|
Based on the bias score for each file, we propose ΔP(r)
favorability in order to evaluate the degree to which a specific
robot is favored or disfavored on a set of robots.txt files.
Let N = |F| be the total number of robots.txt files in the
dataset. The ΔP(r) favorability of a robot r can be defined
as below:
where N_favor(r) and N_disfavor(r) are the number
of times a robot is favored and disfavored respectively.
P_favor(r) is the proportion of the robots.txt files in which
a robot r is favored; P_disfavor(r) is the proportion of the
robots.txt files in which a robot r is disfavored.
The proportions of robots.txt files that favor or disfavor
a specific robot are simple measures for survey statistics;
however, in our dataset the two proportions in isolation are
not very accurate in reflecting the overall biases in our sample
since there are more than two events (favor, disfavor and
no bias). This means that P_favor(r) + P_disfavor(r) < 1.
Each event only reflects one aspect of the bias. For example,
a robot named ia archiver is favored by 0.24% of the
websites in our dataset and the proportion of sites that favor
momspider is 0.21%. Alternatively, the proportions of
sites that disfavor ia archiver and momspider are 1.9%
and 0%, respectively. If we only consider the favored proportion,
we will reach the conclusion that ia archiver is
more favored than momspider.
ΔP(r) is the difference of the proportions of sites that
favor and disfavor a specific robot, and thus treats both cases
in unison. For the above example ΔP(ia archiver) is -
1.66% and ΔP(momspider) is 0.21%. Thus, momspider
is more favored than ia archiver. For any no-bias
robot r, ΔP(r) is 0. The bias measure can eliminate the
misleading cases and still be intuitively understandable (favored
robots have positive numbers and disfavored robots
have negative numbers).
|