2. Robots Exclusion Protocol

The Robots Exclusion Protocol2 is a convention that allows website administrators to indicate to visiting robots which parts of their site should not be visited. If there is no robots.txt file on a website, robots are free to crawl all content.

The format of Robots Exclusion Protocol is described in [12]. A file named “robots.txt” with Internet Media 2http://www.robotstxt.org/wc/norobots.html Type “text/plain” is placed under the root directory of a Web server. Each line in the robots.txt file has the format: < field >:< optionalspace >< value >< optionalspace >. There are three types of case-insensitive tags for the < field > to specify the rules: User-Agent, Allow and Disallow. Another unofficial directive Crawl- Delay is also used by many websites to limit the frequency of robot visits.

The robots.txt file starts with one or more User.Agent fields, specifying which robots the rules apply to, followed by a number of Disallow : and/or Allow : fields indicating the actual rules to regulate the robot. Comments are allowed anywhere in the file, and consist of optional whitespaces. Comments are started with a comment character ‘#’ and terminated by the linkbreak.

A sample robots.txt is listed below (this robots.txt file is from BotSeer3):

It shows that Googlebot cannot visit “/robotstxtanalysis” and “/uastring”. BotSeer can visit any directory and file on the server. All the other robots should follow the rules under User . Agent : . and cannot visit the directories and files matching “/robots/”, “/src/”, “/botseer”, “/uastring”, “/srcseer”, “/robotstxtanalysis”, “/whois”.