Search

How to Block Crawlers, Spiders and Bots from Websites

Listen to this article
disallow-web-crawlers
Block Web Crawlers

The one thing I have often noticed is that while a No Entry sign mostly suffices in preventing people from trespassing a restricted area it’s not an absolutely foolproof plan. There will always be some people who will have complete disregard for this sign and will venture into the restricted area. Using the robots.txt file to disallow crawlers from a certain website is similar. While the instructions in the robots.txt file will disallow crawlers, spiders and bots from crawling your website it does not set any kind of a mandate. There is a possibility that some spiders will still crawl your page. Hence there is a need to block crawlers.

In an earlier article we wrote about How to Disallow Crawlers, Spiders and Bots from Websites. While this method is efficient it clearly does not seem to be sufficient. Therefore, to resolve this issue that arises we will have to come up with a work around and I am going to provide you just that. Now, instead of just disallowing the crawlers with instructions in the robots.txt file we are going to block crawlers.

The method given below to block crawlers has been tried on Apache 2.4.7 (installed on Ubuntu). I expect that it should work with Apache 2.4.x. If you are not able to implement the methods given below on your Apache, then write to me in the comments section. Please give information about your Apache version and Server Operating system. If you are going to provide any sensitive information, then you can write to me at [email protected].

HTTP Basic Authentication to Block Crawlers

The first method I am going to demonstrate to block crawlers is using HTTP Basic Authentication. Sometimes you might have come across the authentication box when you try to access a few websites like the image given below.

Http-Authentication-block-crawlers
Authentication Pop-up for Website

[space]

Above box appears when HTTP Authentication is implemented. To implement this you have to edit virtualhost configuration file of your domain.

Create a Password File

First step is to create Password file containing username and password. Connect to your server using SSH and execute below command

htpasswd -c <path_of_the_password_file> <username>

[space]

Replace <path_of_the_password_file> with the location where you want to create a file which stores username and password combination in encrypted format. For sake of explanation, let’s assume that you provide a path /home/tahseen/Desktop. Replace <username> with username you want. For demonstration purposes I am going to create a username wisdmlabs. So now your command should look something like below.

htpasswd -c /home/tahseen/Desktop/password wisdmlabs

[space]

After replacing password file location and username in above command, hit enter. It would ask you for the password of the username you want to add. Provide it a password and hit enter. After adding username to the file, it will show a message Adding password for user <username>, where <username> will be username you wanted to add. The image below will help you clearly understand what I am saying.

create-password-file-block-crawlers
Create Password File

Note: In above command we have passed -c option, so that it creates a file. If you already have a file where it should save username-password combination, then you don’t need to provide -c parameter.

Edit Configuration File

Till now, we have created username and password. Now, it is time to add this information in site configuration. This step will help us block crawlers from our website. Let’s say, you are trying to implement this for abc.com. Virtualhost configuration for that domain will be in directory /etc/apache2/sites-available  directory. I am assuming that configuration file for abc.com is abc.com.conf. Open that configuration file for editing using the command below.

sudo nano /etc/apache2/sites-available/abc.com.conf

[space]

Append below content in the end of VirtualHost block of the configuration file.

<Directory />
  #Allowing internal IPs to access websites directly. If you don’t have internal ips, then omit below line
  Require ip 192.168.2.1/24
  # Replace /var/.password with the file path you provided in for htpasswd command
  AuthType Basic
  AuthUserFile /var/.password
  AuthName "Authentication Required"
  require valid-user
  Satisfy Any
</Directory>

[space]

After adding above content, save the file and reload Apache by firing command below.

sudo service apache2 reload

[space]

You are done! Now try to visit the website, it should ask you username and password (if you are not visiting from internal network). If this authentication pop up appears then your attempt to block crawlers has worked!

[space]

Responding with 403 to Block Crawlers

The second method to block crawlers is to respond with 403 to crawlers. In this method, what we will do is, we will try to detect user-agents of  crawlers and block them. Disadvantage of this method is, if useragent is changed, crawler can crawl the content.

You can add the content given below in .htaccess file to block crawlers. If it does not work after adding into the .htaccess file, then you will have to make edits in the virtualhost configuration file of corresponding domain like we did in earlier method.

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(googlebot|bingbot|yahoo|AhrefsBot|Baiduspider|Ezooms|MJ12bot|YandexBot|bot|agent|spider|crawler|extractor).*$ [NC]
RewriteRule .* - [F,L]
</IfModule>

[space]

If it still does not work, then make sure that Rewrite module is enabled. To do that, run command below.

apachectl -M

[space]

If it does not show rewrite_module in the output, then you will have to enable it in order to be able to block. If you don’t know how to enable it then refer to the article, Enable Rewrite Module.

[space]

The above two methods should be substantial to help you block crawlers from your website. However, if you are still having any difficulties then feel free to get in touch with me through the comments section.

Sumit Pore

Sumit Pore

Leave a Reply

Your email address will not be published. Required fields are marked *

Get The Latest Updates

Subscribe to our Newsletter

A key to unlock the world of open-source. We promise not to spam your inbox.

Suggested Reads

Join our 55,000+ Subscribers

    The Wisdm Digest delivers all the latest news, and resources from the world of open-source businesses to your inbox.

    Suggested Reads

    WordPress Tips & Tricks
    Ketan Vyawahare

    How to Make Responsive Tables using CSS without Tag Read More »