Websites must be developed by following the highest coding and testing standards in order to achieve an exemplary result. All aspects concerned with security, coding, revisioning, and testing should be considered in doing so.
The conventional practice we adopt while building a website is to develop and test the website on a staging server. We then give the client a demonstration of the website using the staging site. Following an approval from the client we make the website live for the users on the designated server. In this complete process, staging sites play a vital role. Since the staging sites are to be used only by us and the clients, we don’t want any crawler or search engine to crawl the content of our staging site.
In order to achieve this goal, I have made a few changes to the apache configuration and created a robots.txt file in the root folder. I have added instructions to this text file which disallow the web crawlers from going further into the site. Now, whenever a web crawler comes to a website, it first requests for the robots.txt file. Once it is given the robots.txt file it reads the instructions which block it from the website and hence doesn’t crawl the website any further.
Steps to Create Robots.txt File
You will need SSH access to your staging server. I have implemented this on Apache 2.4.7 Webserver installed on Ubuntu 14.04
Open your terminal and connect to your server using SSH. Now create a file ‘robots.txt’ in /var/www folder. Execute the below command to create this file:
[pre]nano /var/www/robots.txt[/pre]
[space]
It will open the editor in a terminal. If it throws an error, then execute the following command.
[pre]sudo nano /var/www/robots.txt[/pre]
[space]
Copy the two lines below in the robots.txt file.
[pre]User-agent: *
Disallow:/ [/pre]
[space]
After adding the above two lines to the robots.txt file, save it. You will now need to change the owner and permissions of robots.txt file. First, we will change the owner of robots.txt file to the user who runs apache. Execute the following command to find the user.
[pre]ps -ef | grep apache2 | grep -v `whoami` | grep -v root | head -n1 | awk ‘{print $1}'[/pre]
[space]
Most of the times it returns ‘www-data’. However if that’s not the case then simply note down whatever it returns. Replace ‘www-data’ with that username in the command below and then run it to change the owner of robots.txt.
[pre]sudo chown www-data /var/www/robots.txt[/pre]
[space]
Now we have to change the permissions of the robots.txt file. To do this, we need to take help of chmod command, like below.
[pre]sudo chmod 644 /var/www/robots.txt[/pre]
[space]
So, we have the robots.txt file ready to be used. Now, let’s make a few changes in the apache configuration, so that if the server gets requests for robots.txt for any site defined in virtualhost (Don’t know, what is virtual host and how it create it? click here), then the recently created robots.txt file will be forwarded.
Steps to Change Apache Configuration
Lets open up apache2.conf which is present in directory /etc/apache2. To open the apache.conf file execute this command:
[pre]sudo nano /etc/apache2/apache2.conf[/pre]
[space]
Add the line below at the bottom of that file.
[pre]Alias /robots.txt /var/www/robots.txt[/pre]
The above line tells apache that if any requests for robots.txt are made, then forward the file /var/www/robots.txt. (Want to know more about Alias directive, read it here.)
[space]
Save the file after adding the above line and reload apache. Apache can be reloaded using the following command.
[pre]sudo service apache2 reload[/pre]
[space]
You are done! So now, if you have created a site abc.com in your virtualhosts, then a request for abc.com/robots.txt will show the content
[pre]User-agent:
Disallow: /[/pre]
[space]
You can also implement HttpBasicAuthentication on top of the above method as an additional measure to disallow crawlers from crawling your website. It shows username and password prompt. If you are looking for that method too, then this guide might help you.
Thanks for reading 🙂
One Response
hai..really informative article on spiders , crawlers , robots …thanks