3.2 Measuring Overall Bias

Based on the bias score for each file, we propose ΔP(r) favorability in order to evaluate the degree to which a specific robot is favored or disfavored on a set of robots.txt files. Let N = |F| be the total number of robots.txt files in the dataset. The ΔP(r) favorability of a robot r can be defined as below:

where N_favor(r) and N_disfavor(r) are the number of times a robot is favored and disfavored respectively. P_favor(r) is the proportion of the robots.txt files in which a robot r is favored; P_disfavor(r) is the proportion of the robots.txt files in which a robot r is disfavored.

The proportions of robots.txt files that favor or disfavor a specific robot are simple measures for survey statistics; however, in our dataset the two proportions in isolation are not very accurate in reflecting the overall biases in our sample since there are more than two events (favor, disfavor and no bias). This means that P_favor(r) + P_disfavor(r) < 1. Each event only reflects one aspect of the bias. For example, a robot named “ia archiver” is favored by 0.24% of the websites in our dataset and the proportion of sites that favor “momspider” is 0.21%. Alternatively, the proportions of sites that disfavor “ia archiver” and “momspider” are 1.9% and 0%, respectively. If we only consider the favored proportion, we will reach the conclusion that “ia archiver” is more favored than “momspider”.

ΔP(r) is the difference of the proportions of sites that favor and disfavor a specific robot, and thus treats both cases in unison. For the above example ΔP(ia archiver) is - 1.66% and ΔP(momspider) is 0.21%. Thus, “momspider” is more favored than “ia archiver”. For any no-bias robot r, ΔP(r) is 0. The bias measure can eliminate the misleading cases and still be intuitively understandable (favored robots have positive numbers and disfavored robots have negative numbers).