1. Introduction

1 Without robots, there would probably be no search engines.
Web search engines, digital libraries, and many other
web applications such as offline browsers, internet marketing
software and intelligent searching agents heavily depend
on robots to acquire documents. Robots, also called
“spiders”, “crawlers”, “bots” or “harvesters”, are self-acting
agents that navigate around-the-clock through the hyperlinks
of the Web, harvesting topical resources at zero costs
1A greatly abbreviated version of this paper appeared as a poster in the
Proceedings of the 16th International World Wide Web Conference, 2007.
of human management [3, 4, 14]. Because of the highly automated
nature of the robots, rules must be made to regulate
such crawling activities in order to prevent undesired impact
to the server workload or access to non-public information.

The Robots Exclusion Protocol has been proposed [12]
to provide advisory regulations for robots to follow. A file
called robots.txt, which contains robot access policies, is
deployed at the root directory of a website and accessible
to all robots. Ethical robots read this file and obey the rules
during their visit to the website. The robots.txt convention
has been adopted by the community since the late 1990s,
and has continued to serve as one of the predominant means
of robot regulation. However, despite the criticality of the
robots.txt convention for both content providers and harvesters,
little work has been done to investigate its usage
in detail, especially at the scale of the Web.

More importantly, as websites may favor or disfavor certain
robots by assigning to them different access policies,
this bias can lead to a “rich get richer” situation whereby
some popular search engines are granted exclusive access to
certain resources, which in turn could make them even more
popular. Considering the fact that users often prefer a search
engine with broad (if not exhaustive) information coverage,
this “rich get richer” phenomenon may introduce a strong
influence on users’ choice of search engines, which will
eventually be reflected in the search engine market share.
On the other hand, since it is often believed (although this
is an exaggeration) that “what is not searchable does not
exist,” this phenomenon may also introduce a biased view
of the information on the Web.