|
A 1999 study of the usage of robots.txt[10] in UK universities
and colleges investigated 163 websites and 53
robots.txt. Robots.txt files were examined in terms of file
size and the use of Robots Exclusion Protocol within the
UK university domains. In 2002, Drott [7] studied the usage
of robots.txt as an aid for indexing to protect information.
60 samples from Fortune Global 500 company websites
were manually examined in this work concluding that
robots.txt files are not widely used by the sampled group
and for most of the sites on which they appear, they are redundant.
...they exclude robots from directories which are
locked anyway. Our investigation shows a contrary result
which may be due to the difference in sample size, domain
and time. Other work addresses the legal aspects of obeying
robots.txt [2, 8] and an overview of Web robots and
robots.txt usage is given in [5].
None of the aforementioned work investigates the content
of robots.txt in terms of biases towards different robots.
In addition, the sample sizes of previous studies have tended
to be relatively small considering the size of theWeb. In this
paper, we present the first quantitative study of such biases,
and conduct a more comprehensive survey of robots.txt usage
on the Web. By implementing our own specialized
robots.txt crawler, we collect real-world data from a considerable
amount of unique websites with different functionalities,
covering the domains of education, government,
news, and business. We investigate the following questions:
Does a robot bias exist?
How should such a bias be measured quantitatively?
What is the implication for such a bias?
Our contributions are:
We propose a quantitative metric to automatically measure
robot biases.
By applying the metric to a large sample of websites,
we present our findings about the most favored and disfavored
robots.
The rest of the paper is organized as follows. In Section
2 we briefly introduce the Robots Exclusion Protocol. In
Section 3, We propose a bias metric and demonstrate how
it is applied to measure the degree of robot bias. In Section
4 we present our data collection for this study. In Section 5
we present our observations on robots.txt usage and discuss
the implications. In Section 6 we conclude our paper with
plans for future work.
|