Robots Exclusion Standard

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
 
(14 intermediate revisions by 2 users not shown)
Line 3: Line 3:
 
|subcat=Web
 
|subcat=Web
 
|extensions={{ext|txt}}
 
|extensions={{ext|txt}}
 +
|wikidata={{wikidata|Q80776}}
 
}}
 
}}
  
The '''Robots Exclusion Standard''' is a method by which webmasters can specify which parts of their site they don't want robots to scan, index, or retrieve. This is done with a file named '''robots.txt''' in the root directory of their site. Well-behaved robots look at this file before proceeding to take action regarding a site (which results in web access logs showing attempted accesses for this filename even if no such file exists). Less-well-behaved robots such as spambots and malware don't heed this file (which is just a voluntary standard with no means of enforcing it), so its use is limited to giving instruction to the reasonable robots such as Googlebot.
+
The '''Robots Exclusion Standard''' is a method by which webmasters can specify which parts of their site they don't want robots to scan, index, or retrieve. This is done with a file named '''robots.txt''' in the root directory of their site. Well-behaved robots look at this file before proceeding to take action regarding a site (which results in web access logs showing attempted accesses for this filename even if no such file exists). Less-well-behaved robots such as spambots and malware don't heed this file (which is just a voluntary standard with no means of enforcing it), so its use is limited to giving instruction to the reasonable robots such as Googlebot. It does, however, cause the effective "retroactive" removal of a site from the Internet Archive Wayback Machine, since it will refuse to display pages (even ones that have been captured in its archive in past scans) in domains/directories that are currently excluded from robots via a robots.txt file.
  
 
The file format is [[TXT|plain text]], probably in [[ASCII]] (though the standard does not specify a character encoding). The standard specifically allows any of the common line break conventions (CR, LF, or CR+LF). Everything following a # character on a line is considered a comment, as is the space (if any) preceding the character. Lines with nothing but a comment are ignored, so don't count as blank lines for the purpose of ending a section of the file.
 
The file format is [[TXT|plain text]], probably in [[ASCII]] (though the standard does not specify a character encoding). The standard specifically allows any of the common line break conventions (CR, LF, or CR+LF). Everything following a # character on a line is considered a comment, as is the space (if any) preceding the character. Lines with nothing but a comment are ignored, so don't count as blank lines for the purpose of ending a section of the file.
 +
 +
The only standard commands in the file are "User-Agent" and "Disallow" (commands start at the beginning of a line, and are followed with a colon (:) and then their parameter value). Several nonstandard commands are also sometimes used.
 +
 +
One such extended command is "sitemap", which can be used to specify the location of a [[sitemap]] formatted in accordance with the sitemap standard. The value of this parameter is the URL of the sitemap, which can be in the same or a different domain from the site.
  
 
To keep robots out of your cgi-bin directory you can use:
 
To keep robots out of your cgi-bin directory you can use:
Line 14: Line 19:
 
  Disallow: /cgi-bin/
 
  Disallow: /cgi-bin/
  
The asterisk means it applies to all user agents. It's also possible to identify specific robots by their user-agent strings and exclude them from things without affecting others.
+
The asterisk means it applies to all user agents. It's also possible to identify specific robots by their user-agent strings and exclude them from things without affecting others. A user-agent line applies to all following Disallow commands until a blank line is reached.
  
 
There are some meta tags like "noindex" and "nofollow" that can be used in [[HTML]] for related effects.
 
There are some meta tags like "noindex" and "nofollow" that can be used in [[HTML]] for related effects.
  
== References ==
+
== Standards ==
 +
* [http://www.robotstxt.org/ "Official" site] (actually a non-binding consensus, not a formal standard)
 +
 
 +
== Sample Files ==
 +
* [http://www.google.com/robots.txt Google's robots.txt]
 +
** And they also have a [http://www.google.com/killer-robots.txt killer-robots.txt]!
 +
* [http://www.ibm.com/robots.txt IBM's robots.txt]
 +
 
 +
== Utilities ==
 +
* [https://github.com/google/robotstxt Google's robots.txt parser, now open-source]
 +
* [https://github.com/randomstring/ParseRobotsTXT Perl robots.txt parser]
 +
 
 +
== Specific search engine / robot policies ==
 +
* [https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt Google / Googlebot]
 +
* [https://help.yahoo.com/kb/search/slurp-crawling-page-sln22600.html?impressions=true Yahoo / Slurp]
 +
* [http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2012/05/03/to-crawl-or-not-to-crawl-that-is-bingbot-s-question.aspx Bing / Bingbot]
 +
 
 +
== Other links and references ==
 
* [http://en.wikipedia.org/wiki/Robots_exclusion_standard Robots Exclusion Standard (Wikipedia)]
 
* [http://en.wikipedia.org/wiki/Robots_exclusion_standard Robots Exclusion Standard (Wikipedia)]
* [http://www.robotstxt.org/wc/norobots.html Standards document] (actually a non-binding consensus, not a formal standard)
 
 
* [http://www.mcanerin.com/EN/search-engine/robots-txt.asp Robots.txt generator/tutorial]
 
* [http://www.mcanerin.com/EN/search-engine/robots-txt.asp Robots.txt generator/tutorial]
 +
* [http://www.beussery.com/blog/index.php/2014/06/robots-txt-disallow-20/ ROBOTS.TXT DISALLOW: 20 Years of Mistakes To Avoid]
 +
* [http://archiveteam.org/index.php?title=Robots.txt Robots.txt] article from the Archive Team website
 +
* [https://opensource.googleblog.com/2019/07/googles-robotstxt-parser-is-now-open.html Google's robots.txt Parser is Now Open Source]
 +
 +
[[Category:File formats with a distinctive filename]]

Latest revision as of 02:42, 2 July 2019

File Format
Name Robots Exclusion Standard
Ontology
Extension(s) .txt
Wikidata ID Q80776

The Robots Exclusion Standard is a method by which webmasters can specify which parts of their site they don't want robots to scan, index, or retrieve. This is done with a file named robots.txt in the root directory of their site. Well-behaved robots look at this file before proceeding to take action regarding a site (which results in web access logs showing attempted accesses for this filename even if no such file exists). Less-well-behaved robots such as spambots and malware don't heed this file (which is just a voluntary standard with no means of enforcing it), so its use is limited to giving instruction to the reasonable robots such as Googlebot. It does, however, cause the effective "retroactive" removal of a site from the Internet Archive Wayback Machine, since it will refuse to display pages (even ones that have been captured in its archive in past scans) in domains/directories that are currently excluded from robots via a robots.txt file.

The file format is plain text, probably in ASCII (though the standard does not specify a character encoding). The standard specifically allows any of the common line break conventions (CR, LF, or CR+LF). Everything following a # character on a line is considered a comment, as is the space (if any) preceding the character. Lines with nothing but a comment are ignored, so don't count as blank lines for the purpose of ending a section of the file.

The only standard commands in the file are "User-Agent" and "Disallow" (commands start at the beginning of a line, and are followed with a colon (:) and then their parameter value). Several nonstandard commands are also sometimes used.

One such extended command is "sitemap", which can be used to specify the location of a sitemap formatted in accordance with the sitemap standard. The value of this parameter is the URL of the sitemap, which can be in the same or a different domain from the site.

To keep robots out of your cgi-bin directory you can use:

User-agent: *
Disallow: /cgi-bin/

The asterisk means it applies to all user agents. It's also possible to identify specific robots by their user-agent strings and exclude them from things without affecting others. A user-agent line applies to all following Disallow commands until a blank line is reached.

There are some meta tags like "noindex" and "nofollow" that can be used in HTML for related effects.

Contents

[edit] Standards

  • "Official" site (actually a non-binding consensus, not a formal standard)

[edit] Sample Files

[edit] Utilities

[edit] Specific search engine / robot policies

[edit] Other links and references

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox