Robots Exclusion Standard

File Format
Name	Robots Exclusion Standard
Ontology	Electronic File Formats Web Robots Exclusion Standard ; ; ;
Extension(s)	.txt

Revision as of 12:56, 5 December 2012

The Robots Exclusion Standard is a method by which webmasters can specify which parts of their site they don't want robots to scan, index, or retrieve. This is done with a file named robots.txt in the root directory of their site. Well-behaved robots look at this file before proceeding to take action regarding a site (which results in web access logs showing attempted accesses for this filename even if no such file exists). Less-well-behaved robots such as spambots and malware don't heed this file (which is just a voluntary standard with no means of enforcing it), so its use is limited to giving instruction to the reasonable robots such as Googlebot.

To keep robots out of your cgi-bin directory you can use:

User-agent: *
Disallow: /cgi-bin/

The asterisk means it applies to all user agents. It's also possible to identify specific robots by their user-agent strings and exclude them from things without affecting others.

There are some meta tags like "noindex" and "nofollow" that can be used in HTML for related effects.

References

Robots Exclusion Standard (Wikipedia)
Standards document (actually a non-binding consensus, not a formal standard)
Robots.txt generator/tutorial

@@ Line 19: / Line 19: @@
 * [http://en.wikipedia.org/wiki/Robots_exclusion_standard Robots Exclusion Standard (Wikipedia)]
 * [http://www.robotstxt.org/wc/norobots.html Standards document] (actually a non-binding consensus, not a formal standard)
-* [[http://www.mcanerin.com/EN/search-engine/robots-txt.asp Robots.txt generator/tutorial]
+* [http://www.mcanerin.com/EN/search-engine/robots-txt.asp Robots.txt generator/tutorial]

Robots Exclusion Standard

Revision as of 12:56, 5 December 2012

References

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox