Robots.txt files are very basic in structure but play an important role in telling search engine crawlers what content they can and cannot crawl for indexing. The main purpose of this file is to prevent certain content within your website’s directories from being indexed and displayed in search engine results pages.
While there is no governing body that enforces whether crawlers are accessing content they shouldn’t, most crawlers are good and abide by the rules set forth in the robots.txt file.
When a crawler reaches your site, it first looks for this file which should reside in the main directory of your site only and nowhere else. If it finds one, it quickly scans the contents and its rules and proceeds if it is allowed.
Here is an example of how Google’s robots.txt file looks. You’ll see on the very first line some text that says “User-agent: *”. In simple terms, User-agent is a web crawler and the asterisk is a kind of “wild card” that states that ANY crawler can access the site for indexing purposes. There are ways to tell some crawlers they have the green light but block others within this first line, but for today we are just focusing on the simpler method of allowing all crawlers.
In order to determine if your site should be using a robots.txt file, take a look at your remote files and folders for your website. If you see directories such as /admin/ or /includes/ then they should probably be blocked from crawlers since they do not have any relevant content specific to search. An example of how this would look in your robots.txt file would be this:
Here’s an example of how the URL would look if you wanted to review it in your browser:
And that’s it! There are some other elements, however, that you can include in robots.txt such as your xml Sitemap location. This should not be confused with an HTML site map that lists the URLs of your site for easy navigation, though! I’ll be talking more about the importance of an xml Sitemap and how they differ from “site maps” in my next blog post.