Understanding robots.txt: Format, functionality and best practices

Understanding robots.txt: Format, functionality and best practices

The robots.txt file is a powerful tool in the world of SEO and web development. Despite its simple structure, it plays a crucial role in informing search engine bots and web crawlers on how to interact with your website. Meaning, what they are allowed to crawl and what they should stay way from.

 In this article, I will review the format of the robots.txt file, how it works, and best practices to ensure your site is both optimized for search engines and protected from unwanted indexing.

What is robots.txt?

The robots.txt file is literally a plain text file that you place somewhere in your root directory of your website. It resembles a list of instructions for specific web crawlers, telling them which parts of your site they can or cannot access. For example, if you have dashboards or admin pages on your website, and they don't make sense to have indexed and searchable in engines, then specifying that disallow rule in the file will be important.

And on the other end of things, you can also specify which pages you do allow to be crawled by bots and crawlers. For example, you might have an internal only folder named '/dashboard/', but it might include a single page that you do want indexed for SEO reasons. You can specifically instruct bots that they should indeed crawl through this page.

How Does robots.txt Work?

When a web crawler visits your site, it first checks for the robots.txt file before beginning its crawling process. Based on the rules defined in this file, the crawler decides which URLs to scan and index. If no robots.txt file is present, the crawler assumes it has permission to index the entire site.

However, by no means does this file prevent any kind of indexing from the bots. Ethical crawlers, such as those from Google and Bing will abide by the rules set in the robots.txt file. However, other unknown 3rd party bots can still crawl your content without bothering to check for permissions.

The robots.txt File Format

The robots.txt file follows a very simple format consisting of one or more "User-agent" directives followed by "Disallow" or "Allow" directives. Here's a breakdown:

User-agent: Specifies the bot to which the rules apply. You can target specific bots or use * to apply to all bots.

User-agent: *

Disallow: Tells the bot which URLs or directories it should not access.

Disallow: /private/

Allow: Specifically allows certain pages or directories, even if they are under a disallowed directory.

Allow: /private/public.html

Sitemap: You can also include the location of your XML sitemap to help search engines discover your site's pages more effectively.

Sitemap: https://www.site.com/sitemap.xml

Example of a robots.txt File

Here’s an example of a robots.txt file:

User-agent: *
Disallow: /admin/
Disallow: /private/


User-agent: Googlebot
Allow: /private/public.html


Sitemap: https://www.site.com/sitemap.xml

In this example: All bots are disallowed from accessing the /admin/ and /private/ directories. Googlebot is specifically allowed to access /private/public-page.html however, despite the disallow for the /private/ directory. And lastly, the location of the sitemap is provided to help crawlers.

Best Practices for Using robots.txt

Here are a few helpful guidelines when implementing a robotx.txt file:

Keeping it simple

In general, most robots.txt files that I've seen were kept on the shorter side, unless there was a very complex folder structure mixed with both private and public facing content. In which case, they were definitely larger but more difficult to manage.

Validate your robots.txt

Ensuring that crawlers can actually visit and read your robots.txt file is important for obvious reasons. Google Search Console includes a robots.txt report that will report on any issues that you may have with your file structure.

That can be found under: Settings -> Crawling -> Open Report

Avoid blocking issues

The biggest issue that you might face if you implement your robots.txt file wrong, is that you might accidentally prevent the crawling of every page on your website. And that's because the directive to do so, isn't very complicated and it can easily go unnoticed.

User-agent: *
Disallow: /

In this example, the wildcard * directive specifies that this rule should be followed by every crawler, and the Disallow value of '/' informs each bot to not crawl every page essentially.

Like I said, an easy error that can go unnoticed.

Common User-Agents

The following is a list of some of the most common bots and crawlers that you can directly target in your robots.txt:

Googlebot
  • Used by Google to crawl sites for indexing
  • User-agent: Googlebot
Bingbot
  • Used by Bing to crawl sites for indexing
  • User-agent: Bingbot
Slurp
  • Used by Yahoo to crawl sites for indexing
  • User-agent: Slurp
DuckDuckBot
  • Used by DuckDuckGo to crawl sites for indexing
  • User-agent: DuckDuckBot
Baiduspider
  • Used by Baidu to crawl sites
  • User-agent: Baiduspider
YandexBot
  • Used by Yandex (Russian search engine) to crawl sites for indexing
  • User-agent: YandexBot

Conclusion

The robots.txt file is a very important tool in your SEO toolkit. It offers you control over how search engines interact with your site. By understanding its format and applying best practices, you can enhance your site’s visibility while protecting sensitive content from unwanted indexing.

Walter G. author of blog post
Walter Guevara is a Computer Scientist, software engineer, startup founder and previous mentor for a coding bootcamp. He has been creating software for the past 20 years.

Get the latest programming news directly in your inbox!

Have a question on this article?

You can leave me a question on this particular article (or any other really).

Ask a question

Community Comments

No comments posted yet

Add a comment