|
The Robots Exclusion Protocol2 is a convention that allows
website administrators to indicate to visiting robots
which parts of their site should not be visited. If there is
no robots.txt file on a website, robots are free to crawl all
content.
The format of Robots Exclusion Protocol is described
in [12]. A file named robots.txt with Internet Media
2http://www.robotstxt.org/wc/norobots.html
Type text/plain is placed under the root directory of a
Web server. Each line in the robots.txt file has the format:
< field >:< optionalspace >< value ><
optionalspace >. There are three types of case-insensitive
tags for the < field > to specify the rules: User-Agent,
Allow and Disallow. Another unofficial directive Crawl-
Delay is also used by many websites to limit the frequency
of robot visits.
The robots.txt file starts with one or more User.Agent
fields, specifying which robots the rules apply to, followed
by a number of Disallow : and/or Allow : fields indicating
the actual rules to regulate the robot. Comments are
allowed anywhere in the file, and consist of optional whitespaces.
Comments are started with a comment character #
and terminated by the linkbreak.
A sample robots.txt is listed below (this robots.txt file is
from BotSeer3):
It shows that Googlebot cannot visit /robotstxtanalysis
and /uastring. BotSeer can visit any directory and
file on the server. All the other robots should follow the
rules under User . Agent : . and cannot visit the directories
and files matching /robots/, /src/, /botseer,
/uastring, /srcseer, /robotstxtanalysis, /whois.
|