If you are into SEO or anywhere related to it, you would be having these terms (robots.txt and noindex) somewhere in your vocabulary. And if that’s not the case you have came to right place to learn it. These tags are generally used to prevent search engines from crawling and indexing your site. Now you would be thinking why would anyone do that? Let me give you few instances where you can use these tags:
Where can you use it?
You can use for pages which are not important for search engines and you would not like to show them in search results. For eg:
- Admin section pages of a site.
- Deindex non converting pages and let search engines focus only on converting ones.
- Search Results pages.
- Error pages.
Now the question which arises is that what is the difference between them? So lets get to know “What are Robots.txt and Noindex and what is the difference“.
Crawling and Indexing
Before getting into detail of these tags I would like to explain you the “difference between Crawling and Indexing” as it will help you in understanding the concepts easily.
Crawling means if Google bot (computerized algorithms) visits a page in your site and reads the content inside it is called crawling.
Indexing refers to when Google saves your site address in its index (collection of site urls and information).
Robots.txt is basic text file which you upload in the root directory of your site. It can be found out at www.sitename/robots.txt and has instructions for search engines to follow. If you have used the term ‘Disallow’ for a particular directory or a page, the search engines understand that and will not crawl that page.
User-agent: * Disallow: /wp-admin/ Disallow: /test/abc.html
In the above case search engines will not crawl the directory wp-admin and the page abc.html. What I mean by crawl is that they will not read the page but they might index it. For eg if some page has a link to abc.html then search engines might show this page in search results in rarest of cases (when there is no other relevant data to show) but it will show only the url without any description as it does not have any information because the page is not crawled. So using robots.txt assure you that your page will not be read by search engines but it does not guarantee you deindexing a site. Here is a video by Matt Cutts:
Disallow: /thispage.html does not block /ThisPage.html.
It will only block the exact match. Thus if you have Canonical Issues (the same content under variant URLs, including Case differences), then the chances are that you will have issues with robots.txt blocking successfully.
So if robots.txt is not able to de-index your site then how to do that? There are two ways of doing it one is “request for url removal” and other is “Noindex”.
Request removal of an entire page:
- Go to the Google public URL removal tool.
- Click New Removal Request.
- Type the URL of the webpage you want removed (not the Google search results URL or cached page URL). The URL is case-sensitive—use exactly the same characters and capitalization that the site uses.
- Click Continue.
- Click Remove this page.
Noindex is a meta tag that you put on the head section of your website. Unlike ‘Robots.txt’, the ‘Noindex’ allows search engines to read the pages but instructs them to remove it from memory that it was ever indexed. That means when the search engine comes to a page with noindex meta tag, it will continue to read the content inside it including the links (so link juice is passed) but will forget it after reading and will not index it. For eg:
<meta name="robots" content="noindex" />
This line of code in the header will prevent search engines form indexing this page. The drawback with this is that you need to put this code in all the page which you want to deindex, so it becomes difficult to manage if number of pages becomes too much. And the good thing is that it is supported by all the major search engines.
Which to use and when?
I would suggest you to use “Noindex” meta tag instead of “Robots.txt” if you want to deindex a page or directory from search engine records. There are two reasons for it, first being the page will be deindexed by search engines itself the next time your site is crawled and you do not need to do it manually (like sending a url removal request). The second reason being that it is not going to waste your Pagerank which is passed from a noindex page because it is read by search engines but not from a robots.txt page.
The issue with “Noindex” is that it has to be done page per page basis so it beocmes difficult to manage while “robots.txt” allows easier way of doing using a single file. So think over both of them and go with one which suits your requirement.
Things to Keep in mind
1) The page with a ‘Robots.txt’ will not be read by the search engine so any links on that page will not be crawled. This would not allow the link juice to pass and so the Pagerank gets wasted.
2) On the other hand the page with ‘Noindex tag’ will be read by the search engine and the link juice will be passed to consecutive pages (if its a dofollow link) so Pagerank is utilized since it is read by search engine but will not be indexed.
3) If you use robots.txt and the url removal from google, that will work, the page will get deindexed but then Google will never crawl that page again and therefore not follow any of the links on that page. You are blocking their crawler so your site will not be crawled as thoroughly which means pages can be missed, a lower percentage of your pages will be indexed.
4) Disallowing a URL in robots.txt does NOT mean it will magically be removed from the Index. That’s what the URL Removal Request tool is for.
Some Related Stuff
- NOINDEX tag tells Google not to index a specific page
- NOFOLLOW tag tells Google not to follow the links on a specific page
- NOARCHIVE tag tells Google not to store a cached copy of your page
- NOSNIPPET tag tells Google not to show a snippet (description) under your Google listing, it will also not show a cached link in the search results