|
Our definition of a favored robot is a robot allowed to access
more directories than the universal robot according to
3http://botseer.ist.psu.edu/robots.txt
the robots.txt file in the website. The universal robot is any
robot that has not matched any of the specific User-Agent
names in the robots.txt file. In other words, the universal
robot represents all the robots that do not appear by name in
the robots.txt file.
Let F be the set of robots.txt files in our dataset. Given
a robots.txt file f . F, let R denote the set of named robots
for a given robots.txt file f. For each named robot r . R,
We define the GetBias(r, f) algorithm as specified in Algorithm
1. GetBias measures the degree to which a named
robot r is favored or disfavored in a given robots.txt file f.
Let DIR be the set of all directories that appear in a
robots.txt file f of a specific website. DIR is used as
an estimation of the actual directory structure in the website
because the Robot Exclusion Protocol considers any
directory in the website that does not match the directories
in the robots.txt as an allowed directory by default.
Du . DIR is the set of directories that the universal robot
* is allowed to visit. If there are no rules specified for
User . Agent : ., the universal robot can access everything
by default. Dr . DIR is the set of directories that
a given robot r is allowed to visit. |Du| and |Dr| are the
number of directories in Du and Dr.
For a given robot r, the algorithm first counts how many
directories in DIR is allowed for r. Then it calculates the
bias score for robot r as the difference between the number
of directories in DIR that are allowed for the robot r and
the number of directories that are allowed for the universal
robot. In the GetBias algorithm, the bias of the universal
robot is treated as the reference point 0 (GetBias returns 0).
The bias scores of favored robots returned by GetBias are
positive values. Higher score of a robot means the robot is
more favored. On the contrary, the bias scores of disfavored
robots returned by GetBias are negative values, which is
consistent with our bias definition. Thus, the bias of a robot
in a robots.txt file can be represented by a categorical variable
with three categories: favored, disfavored, and no bias.
As an example, consider the robots.txt file in
http://BotSeer.ist.psu.edu from Section 2: DIR =
{/robots/, /src/, /botseer, /uastring, /srcseer,
/robotstxtanalysis, /whois}. According to the algorithm
we have Du = {null}, Dbotseer = {/robots/, /src/,
/botseer, /uastring, /srcseer, /robotstxtanalysis,
/whois} and Dgoogle = {/robots/, /src/, /botseer,
/srcseer, /whois}. Thus, |Du| = 0, |Dbotseer|=7,
and |Dgoogle|=5. According to Algorithm 1, biasu =
|Du| . |Du| = 0, biasbotseer = |Dbotseer| . |Du| = 7
and biasgooglebot = |Dgoogle| . |Du| = 5. Thus, the robots
googlebot and botseer are favored by this website, and
they are categorized as favored. All other robots will be
categorized as no bias.
|