|
Search engines largely rely on robots (i.e., crawlers or
spiders) to collect information from the Web. Such crawling
activities can be regulated from the server side by deploying
the Robots Exclusion Protocol in a file called robots.txt.
Ethical robots will follow the rules specified in robots.txt.
Websites can explicitly specify an access preference for each
robot by name. Such biases may lead to a rich get richer
situation, in which a few popular search engines ultimately
dominate the Web because they have preferred access to resources
that are inaccessible to others. This issue is seldom
addressed, although the robots.txt convention has become a
de facto standard for robot regulation and search engines
have become an indispensable tool for information access.
We propose a metric to evaluate the degree of bias to which
specific robots are subjected. We have investigated 7,593
websites covering education, government, news, and business
domains, and collected 2,925 distinct robots.txt files.
Results of content and statistical analysis of the data confirm
that the robots of popular search engines and information
portals, such as Google, Yahoo, and MSN, are generally
favored by most of the websites we have sampled. The
results also show a strong correlation between the search
engine market share and the bias toward particular search
engine robots.
|