3.1 The GetBias Algorithm

Our definition of a favored robot is a robot allowed to access more directories than the universal robot according to 3http://botseer.ist.psu.edu/robots.txt the robots.txt file in the website. The universal robot is any robot that has not matched any of the specific User-Agent names in the robots.txt file. In other words, the universal robot represents all the robots that do not appear by name in the robots.txt file.

Let F be the set of robots.txt files in our dataset. Given a robots.txt file f . F, let R denote the set of named robots for a given robots.txt file f. For each named robot r . R, We define the GetBias(r, f) algorithm as specified in Algorithm 1. GetBias measures the degree to which a named robot r is favored or disfavored in a given robots.txt file f.

Let DIR be the set of all directories that appear in a robots.txt file f of a specific website. DIR is used as an estimation of the actual directory structure in the website because the Robot Exclusion Protocol considers any directory in the website that does not match the directories in the robots.txt as an allowed directory by default. Du . DIR is the set of directories that the universal robot “*” is allowed to visit. If there are no rules specified for User . Agent : ., the universal robot can access everything by default. Dr . DIR is the set of directories that a given robot r is allowed to visit. |Du| and |Dr| are the number of directories in Du and Dr.

For a given robot r, the algorithm first counts how many directories in DIR is allowed for r. Then it calculates the bias score for robot r as the difference between the number of directories in DIR that are allowed for the robot r and the number of directories that are allowed for the universal robot. In the GetBias algorithm, the bias of the universal robot is treated as the reference point 0 (GetBias returns 0). The bias scores of favored robots returned by GetBias are positive values. Higher score of a robot means the robot is more favored. On the contrary, the bias scores of disfavored robots returned by GetBias are negative values, which is consistent with our bias definition. Thus, the bias of a robot in a robots.txt file can be represented by a categorical variable with three categories: favored, disfavored, and no bias. As an example, consider the robots.txt file in http://BotSeer.ist.psu.edu from Section 2: DIR = {“/robots/”, “/src/”, “/botseer”, “/uastring”, “/srcseer”, “/robotstxtanalysis”, “/whois”}. According to the algorithm we have Du = {null}, Dbotseer = {“/robots/”, “/src/”, “/botseer”, “/uastring”, “/srcseer”, “/robotstxtanalysis”, “/whois”} and Dgoogle = {“/robots/”, “/src/”, “/botseer”, “/srcseer”, “/whois”}. Thus, |Du| = 0, |Dbotseer|=7, and |Dgoogle|=5. According to Algorithm 1, biasu = |Du| . |Du| = 0, biasbotseer = |Dbotseer| . |Du| = 7 and biasgooglebot = |Dgoogle| . |Du| = 5. Thus, the robots “googlebot” and “botseer” are favored by this website, and they are categorized as favored. All other robots will be categorized as no bias.