Robots.txt for SEO: Everything You Need to Know
Any webmaster knows that SEO is largely out of your hands.
While you can create a site following SEO best practices for the best chance at ranking, search engine crawlers still need to find and crawl your site.
You actually have a bit of control over how web crawlers index your site using Robots.txt, even on a page by page basis.
In this article you will learn:
- What a Robots.txt file is (and why it’s important for SEO)
- The syntax explain
- Common mistakes you should avoid.
Getting Started with Robots.txt
If you want to have a say in what SEO robots comb through on your site, you’ll need a robots.txt file to do so.
While it doesn’t necessarily have the final say in how Google treats your site, it can have a powerful effect on your SEO results. By allowing you to have a say in how Google views your site, you can also influence their judgments.
So, if you want to improve your crawl frequency and search performance on Google, how can you create a robots.txt for SEO?
We’re taking it back to the beginning of robots.txt files to break down:
- What they are exactly
- Where to locate them
- How to create your own
- The syntax behind them
- The benefits of using them
- How to Disallow vs. Noindex
- Mistakes to avoid
Let’s get started by exploring what a robots.txt file is.
What is a Robots.Txt FIle?
When the internet was still young and brimming with potential, web developers came up with a way to crawl and index new pages on the internet.
These tools were named crawlers, spiders, or robots. You’ve likely heard them all used interchangeably.
A Robots.txt file looks like this:
Every now and then, these robots would meander from where they were supposed to be and start crawling and indexing sites that weren’t meant to be indexed – lite sites currently under maintenance.
There had to be a solution.
The creator of Aliweb, the world’s first search engine, recommended a “roadmap” solution that would help the robots stay on course. In June 1994, this roadmap was finalized and named the “Robots Exclusion Protocol.”
What does this protocol look like when executed? Like this (courtesy of The Web Robots Pages):
The protocol set in place the guidelines that all bots, including Google’s, must follow. However, some dark-hat robots, like spyware or malware, work outside of these rules.
Want to see what it’s like yourself? Just type in the URL of any website followed by “/robots.txt” at the end. Here’s what Buffer’s file looks like:
Since it’s a relatively small site, there’s not much to it. Type the same thing into Google’s URL, for example, and you’ll see a lot more.
Where to Find the Robots.Txt File
You’ll find your robots.txt file in your site’s root directory. To access it, open your FTP cPanel and then search in your public_html site directory.
There’s not much to these files, so they won’t be that large of a size. Expect to see a few hundred bytes at the most.
Once you get the file opened in your text editor, you’ll see some information about a sitemap and the terms “User-Agent,” “allow,” and “disallow” written up.
You can also just add /robots.txt to the end of most URLs to find it:
How to Create Robots.txt File for SEO
If you need to make your own, know that Robots.txt is a basic text file that’s straightforward enough for a true beginner to create.
Just make sure you have a simple text editor, and then open up a blank sheet that you’ll save as “robots.txt”.
Then, log into your cPanel and find the public_html folder as mentioned above. With the file open and the folder pulled up, drag the file into the folder.
Now, set the correct permissions for the file. You want it set up so that you, as the owner, are the only party with permission to read, write, and edit that file. You should see a “0644” permission code.
If you do not see that code, click on the file, then select “file permission.” All done!
Robots.txt Syntax Explained
Looking at the robots.txt example above, you’ll see there’s likely some unfamiliar syntax. So what do these words mean? Let’s find out.
The files consist of multiple sections, each a “directive.” Each directive begins with a specified user-agent, which will be under the name of the specific crawl bot the code is directed at.
You have two options here:
- Use a wildcard to address all search engines at one time
- Address each search engine specifically, one by one
When a crawler is sent out to a site, it will be drawn to the section that is speaking to it. Each search engine will handle SEO robots.txt files a bit differently. You can perform some simple research to learn more about how Google or Bing handles things, specifically.
See the “user-agent” portion? This sets apart a bot from the crowd, essentially by calling it by name.
If your goal is to tell one of Google’s crawlers what to do on your site, begin with “User-agent: Googlebot.”
However, the more specific you can get, the better. It’s common to have more than one directive, so call out each bot by name when necessary.
Pro Tip: Most search engines use more than one bot. A little bit of research will tell you the most common bots to target.
This portion is currently only supported by Yandex, though you may see some claims that Google supports it.
With this directive, you have the power to determine whether you want to show the www. before your site URL by saying something like this:
Since we’re only able to confirm that Yandex supports this, it’s not recommended that you rely on it too heavily.
The second line within a section is Disallow. This tool lets you specify which parts of your websites shouldn’t get crawled by bots. If you leave the disallow empty, it essentially tells the bots that it’s a free-for-all, and they can crawl as they please.
The sitemap directive helps you tell search engines where they can find your XML sitemap, which is a digital map that can help search engines find important pages on your site and learn how often they are updated.
You’ll find that search engines like Yandex, Bing, and Google can become a bit trigger-happy when crawling, but you can keep them at bay for a while with a crawl-delay initiative.
When you apply a line that says “Crawl-delay:10,” you’re telling the bots to wait ten seconds before they crawl the site or ten seconds in between crawls.
Benefits of Using Robots.Txt for SEO
Now that we’ve covered the basics of robots.txt files and gone over a couple of directive uses, it’s time to put together your file.
While a robots.txt file isn’t a required aspect of a successful website, there are still many key benefits that you should be aware of:
- Keep Bots Away from Private Files – You can prevent crawlers from looking in your private folders, which makes them much harder to index.
- Maintain Resources – Every time a bot crawls your site, it’s going to use up server resources like bandwidth and more. If your site has a lot of content on it, like an e-commerce site, you’ll be amazed at how quickly these resources can be drained. You can use robots.txt for SEO to make it harder for spiders to access individual aspects, helping to retain your most valuable resources for true site visitors.
- Clarify Sitemap Location – If you want a crawler to go through your sitemap, you want to ensure it knows where to go. Robots.txt files can help with this.
- Protect Duplicate Content from SERPs – By adding a specific rule to your robots, you can keep them from indexing pages on your website that hold duplicate content.
Naturally, you want search engines to work their way through the most critical pages on your site.
If you confine the bots to specific pages, you’ll have better control over which pages are then placed in front of searchers on Google.
Just be sure you never totally block off a crawler from seeing certain pages – it can earn you penalties.
Disallow vs. Noindex
If you don’t want a crawler to access a page, typically you would either use a disallow or noindex directive. However, in 2019, Google announced that they stopped supporting it, along with a few other rules.
For those who still wanted to apply the noindex directive, we had to get creative. There are a few options to choose from instead:
- Noindex Tag – You can implement this as an HTTP response header with an X-Robots-Tag, or you can create a <meta> tag, which you can implement in the <head> section. Just remember that if you blocked bots from this page, they’ll likely never see the tag and may still include the page in SERPs.
- Password Protect – In most cases, if you hide a page behind a password entry, it shouldn’t be on Google index.
- Disallow Rule – When you add specific disallow rules, search engines won’t crawl the page, and it won’t be indexed. Just keep in mind that they may still be able to index it based on information they gather from other pages and links.
- 404/410 HTTP Status Codes – The 404 and 410 status codes exemplify web pages that don’t exist anymore. Once this kind of page is fully processed once, it will be permanently dropped from Google’s index.
- Search Console Remove URL – This tool won’t solve the indexing problem completely, but it will remove the page temporarily.
So, what’s better? Noindex or the disallow rule? Let’s dive in.
Since Google no longer officially supports noindex, you’ll have to rely on the alternatives listed above or rely on the tried and true disallow rule. Just know that the disallow rule won’t be quite as effective as the standard noindex tag would be.
While it does block the bots from crawling that page, they can still gather information from other pages as well as both internal and external links, which could lead to that page showing up in SERPs.
5 Robots.txt Mistakes to Avoid
We have now talked about what a robots.txt file is, how to find or create one, and the different ways to use it. But we haven’t talked about the common mistakes that too many people make when using robots.txt files.
When not used correctly, you may run into an SEO disaster. Avoid this fate by steering clear of these common mistakes:
1. Blocking Good Content
You do not want to block any good content that would be of help to site crawlers and users searching for your site on search engines.
If you use a noindex tag or robots.txt file to block good content, you’ll hurt your own SEO results.
If you notice lagging results, check through your pages thoroughly for any disallow rules or noindex tags.
2. Overusing the Crawl-Delay Directive
If you use the crawl-delay directive too often, you’ll limit how many pages the bots can crawl.
While this might not be an issue for huge sites, smaller sites with limited content can hurt their own chances of earning high SERP rankings by overusing these tools.
3. Preventing Content Indexing
If you want to prevent bots from crawling the page directly, disallowing it is your best bet.
However, it won’t always work. If the page has been linked externally, it can still flow through into the page.
Additionally, illegitimate bots like malware don’t subscribe to these rules, so they’ll index the content anyway.
4. Using Improper Cases
It’s important to note that robots.txt files are case sensitive.
Creating a directive and using an uppercase letter won’t work.
Everything needs to be lowercase if you want it to be effective.
5. Shielding Malicious Duplicate Content
Sometimes, duplicate content is necessary, and you’ll want to hide it from being indexed.
But other times, Google bots will know when you’re trying to hide something that shouldn’t be hidden. A lack of content, sometimes, can actually draw attention to something fishy going on.
If Google sees that you’re attempting to manipulate rankings in order to get more traffic, they may penalize you.
But you can get around this by rewriting the duplicate content, adding in a 301 redirect, or using a Rel=”canonical tag.
Put It All Together
Now that you know everything about robots.txt for SEO, it’s time to use what you’ve learned to create a file and test it out.
It might take you a bit to get the hang of the process and ensure you’ve set everything up as you like it, but once you’re set, you’ll see the difference that comes with taking control over how search bots handle your site.
That’s the power of robots.txt files.