The one thing I have often noticed is that while a No Entry sign mostly suffices in preventing people from trespassing a restricted area it’s not an absolutely foolproof plan. There will always be some people who will have complete disregard for this sign and will venture into the restricted area. Using the robots.txt file to disallow crawlers from a certain website is similar. While the instructions in the robots.txt file will disallow crawlers, spiders and bots from crawling your website it does not set any kind of a mandate. There is a possibility that some spiders will still crawl your page. Hence there is a need to block crawlers.
In an earlier article we wrote about How to Disallow Crawlers, Spiders and Bots from Websites. While this method is efficient it clearly does not seem to be sufficient. Therefore, to resolve this issue that arises we will have to come up with a work around and I am going to provide you just that. Now, instead of just disallowing the crawlers with instructions in the robots.txt file we are going to block crawlers.
The method given below to block crawlers has been tried on Apache 2.4.7 (installed on Ubuntu). I expect that it should work with Apache 2.4.x. If you are not able to implement the methods given below on your Apache, then write to me in the comments section. Please give information about your Apache version and Server Operating system. If you are going to provide any sensitive information, then you can write to me at [email protected].
HTTP Basic Authentication to Block Crawlers
The first method I am going to demonstrate to block crawlers is using HTTP Basic Authentication. Sometimes you might have come across the authentication box when you try to access a few websites like the image given below.
Above box appears when HTTP Authentication is implemented. To implement this you have to edit virtualhost configuration file of your domain.
Create a Password File
First step is to create Password file containing username and password. Connect to your server using SSH and execute below command
htpasswd -c <path_of_the_password_file> <username>
Replace <path_of_the_password_file> with the location where you want to create a file which stores username and password combination in encrypted format. For sake of explanation, let’s assume that you provide a path /home/tahseen/Desktop. Replace <username> with username you want. For demonstration purposes I am going to create a username wisdmlabs. So now your command should look something like below.
htpasswd -c /home/tahseen/Desktop/password wisdmlabs
After replacing password file location and username in above command, hit enter. It would ask you for the password of the username you want to add. Provide it a password and hit enter. After adding username to the file, it will show a message Adding password for user <username>, where <username> will be username you wanted to add. The image below will help you clearly understand what I am saying.
Note: In above command we have passed -c option, so that it creates a file. If you already have a file where it should save username-password combination, then you don’t need to provide -c parameter.
Edit Configuration File
Till now, we have created username and password. Now, it is time to add this information in site configuration. This step will help us block crawlers from our website. Let’s say, you are trying to implement this for abc.com. Virtualhost configuration for that domain will be in directory /etc/apache2/sites-available directory. I am assuming that configuration file for abc.com is abc.com.conf. Open that configuration file for editing using the command below.
sudo nano /etc/apache2/sites-available/abc.com.conf
Append below content in the end of VirtualHost block of the configuration file.
<Directory /> #Allowing internal IPs to access websites directly. If you don’t have internal ips, then omit below line Require ip 192.168.2.1/24 # Replace /var/.password with the file path you provided in for htpasswd command AuthType Basic AuthUserFile /var/.password AuthName "Authentication Required" require valid-user Satisfy Any </Directory>
After adding above content, save the file and reload Apache by firing command below.
sudo service apache2 reload
You are done! Now try to visit the website, it should ask you username and password (if you are not visiting from internal network). If this authentication pop up appears then your attempt to block crawlers has worked!
Responding with 403 to Block Crawlers
The second method to block crawlers is to respond with 403 to crawlers. In this method, what we will do is, we will try to detect user-agents of crawlers and block them. Disadvantage of this method is, if useragent is changed, crawler can crawl the content.
You can add the content given below in .htaccess file to block crawlers. If it does not work after adding into the .htaccess file, then you will have to make edits in the virtualhost configuration file of corresponding domain like we did in earlier method.
<IfModule mod_rewrite.c> RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^.*(googlebot|bingbot|yahoo|AhrefsBot|Baiduspider|Ezooms|MJ12bot|YandexBot|bot|agent|spider|crawler|extractor).*$ [NC] RewriteRule .* - [F,L] </IfModule>
If it still does not work, then make sure that Rewrite module is enabled. To do that, run command below.
apachectl -M
If it does not show rewrite_module in the output, then you will have to enable it in order to be able to block. If you don’t know how to enable it then refer to the article, Enable Rewrite Module.
The above two methods should be substantial to help you block crawlers from your website. However, if you are still having any difficulties then feel free to get in touch with me through the comments section.
Leave a Reply