Robots Exclusion Standard
Dan Tobias (Talk | contribs) (→References) |
Dan Tobias (Talk | contribs) m (→References) |
||
Line 19: | Line 19: | ||
* [http://en.wikipedia.org/wiki/Robots_exclusion_standard Robots Exclusion Standard (Wikipedia)] | * [http://en.wikipedia.org/wiki/Robots_exclusion_standard Robots Exclusion Standard (Wikipedia)] | ||
* [http://www.robotstxt.org/wc/norobots.html Standards document] (actually a non-binding consensus, not a formal standard) | * [http://www.robotstxt.org/wc/norobots.html Standards document] (actually a non-binding consensus, not a formal standard) | ||
− | * | + | * [http://www.mcanerin.com/EN/search-engine/robots-txt.asp Robots.txt generator/tutorial] |
Revision as of 12:56, 5 December 2012
The Robots Exclusion Standard is a method by which webmasters can specify which parts of their site they don't want robots to scan, index, or retrieve. This is done with a file named robots.txt in the root directory of their site. Well-behaved robots look at this file before proceeding to take action regarding a site (which results in web access logs showing attempted accesses for this filename even if no such file exists). Less-well-behaved robots such as spambots and malware don't heed this file (which is just a voluntary standard with no means of enforcing it), so its use is limited to giving instruction to the reasonable robots such as Googlebot.
To keep robots out of your cgi-bin directory you can use:
User-agent: * Disallow: /cgi-bin/
The asterisk means it applies to all user agents. It's also possible to identify specific robots by their user-agent strings and exclude them from things without affecting others.
There are some meta tags like "noindex" and "nofollow" that can be used in HTML for related effects.
References
- Robots Exclusion Standard (Wikipedia)
- Standards document (actually a non-binding consensus, not a formal standard)
- Robots.txt generator/tutorial