1.1 Related Work

A 1999 study of the usage of robots.txt[10] in UK universities and colleges investigated 163 websites and 53 robots.txt. Robots.txt files were examined in terms of file size and the use of Robots Exclusion Protocol within the UK university domains. In 2002, Drott [7] studied the usage of robots.txt as an aid for indexing to protect information. 60 samples from Fortune Global 500 company websites were manually examined in this work concluding that “robots.txt files are not widely used by the sampled group and for most of the sites on which they appear, they are redundant. ...they exclude robots from directories which are locked anyway.” Our investigation shows a contrary result which may be due to the difference in sample size, domain and time. Other work addresses the legal aspects of obeying robots.txt [2, 8] and an overview of Web robots and robots.txt usage is given in [5].

None of the aforementioned work investigates the content of robots.txt in terms of biases towards different robots. In addition, the sample sizes of previous studies have tended to be relatively small considering the size of theWeb. In this paper, we present the first quantitative study of such biases, and conduct a more comprehensive survey of robots.txt usage on the Web. By implementing our own specialized “robots.txt” crawler, we collect real-world data from a considerable amount of unique websites with different functionalities, covering the domains of education, government, news, and business. We investigate the following questions:

  • • Does a robot bias exist?
  • • How should such a bias be measured quantitatively?
  • • What is the implication for such a bias?
  • Our contributions are:

  • • We propose a quantitative metric to automatically measure robot biases.
  • • By applying the metric to a large sample of websites, we present our findings about the most favored and disfavored robots.
  • The rest of the paper is organized as follows. In Section 2 we briefly introduce the Robots Exclusion Protocol. In Section 3, We propose a bias metric and demonstrate how it is applied to measure the degree of robot bias. In Section 4 we present our data collection for this study. In Section 5 we present our observations on robots.txt usage and discuss the implications. In Section 6 we conclude our paper with plans for future work.