CA

12

4.7

5

2.92

1.14

1.22

53%

50%

CZ

6

3.8

4

2.02

1.28

1.35

65%

29%

NZ

8

3.3

3

2.74

1.13

1.03

48%

10%

UK

22

5.4

5

6.10

1.50

1.39

66%

46%

US

23

5

5

5.36

1.16

1.16

48%

30%

Directed

Directed

Directed

Directed

Directed

Median

Country

Diameter

Distance

Distance

Diameter

Distance

Distance

%age of Unreachable Pairs

Table 4. Navigability related properties of audit offices’ websites.

Normalized

Average

Median

Average

Whole Site

Navigable Content

doc files) should be excluded because these are by definition in the OUT component. Another question is how strongly the size of a website influences the size of the SCC. For smaller sites, it is relatively easy to ensure a SCC of 100%, which may explain why CZ and NZ have such large SCC’s after filtering. In contrast, this is much bigger challenge for a site like the audit office of the US with almost 20,000 pages. It seems clear that some normalization is needed to account for the size of a website. However, we are uncertain what form this normalization should take.

than a large website. Therefore in order to meaningfully compare the values, they should be normalized. We decided to apply a normalization factor equal to the logarithm of the size of the website. This is based on the models of [35] and [39] where the diameter grows linearly with logarithm of the size of the network.

Looking at Table 4, we observe that the normalized average distance for CA and US is now comparable with NZ. The normalized average distance for CZ has worsened, while the UK’s normalized average distance remains the worst.

4.1.2 Distance While the strongly connected component indicates what percentage of the site can be accessed by navigating the link structure, it does not reveal the number of clicks needed to move from one node to another. It is therefore interesting to measure the distance, in number of hyperlinks followed, in order to navigate between two randomly selected nodes of a website.

The average distance does not reveal the distribution of longest shortest paths. Figure 1 shows the cumulative distribution of path lengths for each data set. The asymptotic value of each curve has been normalized to reflect the different percentage of nodes that are unreachable for each dataset.

For the following calculations we establish the longest shortest path between a random node and all other nodes. We do this for all nodes in the network and establish, for every node, the longest shortest path a user could follow in order to get from this node to another node.

A worst case measure of distance is the directed diameter of the site, which is defined as the longest of all calculated shortest paths. Perhaps a more useful measure is the average distance, which is defined as the average over all the longest shortest paths of each node. We also quote the median as this measure is less susceptible to extreme outliers. Table 4 enumerates these values for each dataset. The path length does not change significantly when calculated for the whole site and only for the navigable content. This is because unreachable paths are ignored in these calculations. However, obviously the percentage of unreachable pairs can shrink dramatically. It is interesting to observe that the percentage of unreachable pairs is almost equal to the percentage of the OUT component. This indicates that there are almost no paths between pairs of nodes in the OUT component, i.e. the number of reachable pairs consists almost entirely of nodes in SCC. This implies that once a user enters the OUT component, there is not only no way back, but there is very little chance of navigating to another page.

The average directed distances of the websites in our sample are much lower than the ones that Albert et al [35] calculated for the Internet, indicating that indeed there is a higher degree of connectivity within a managed website than for the Internet as a whole. However, as with the SCC, it is clear that a small website will tend to have a smaller diameter and an average distance

Figure 1. Cumulative sum of number of pairs of pages that have a path between them of less than a certain length. The x-axis represents the path length. The y-axis represents the fraction of all possible pairs of pages in the website that are connect by a path of length less than x.

Figure 1 shows that for those nodes that are reachable, the majority can be reached by six or less hyperlinks. While this value might seem rather small, Huberman et al [23-25] found that an average user follows only 4 links within a site. For a path length of four or less, we see that (i) for NZ, most of the reachable nodes are accessible, (ii) for CZ, over 50% of reachable nodes are accessible, (iii) for CA and US, somewhat over a third of reachable nodes are accessible, and (iv) for the UK, about a third of reachable nodes are accessible. This is particularly poor for the UK, since not only does it have the lowest accessibility for path lengths of four or less, but this is compounded by the fact that the percentage of reachable nodes is also the lowest.