Country CA CZ NZ UK US

Average (std. dev.)

Median 3 2 14 2 10

Whole Site

Navigable Content

20.0 (105.0) 7.7 (18.5) 19.9 (47.6) 13.3 (57.3) 23.8 (277.9)

0.341 0.427 0.607 0.288 0.509

0.341 0.507 0.789 0.229 0.515

for the Web as a whole, we would have to consider the external links as well.

Many links do not necessarily guarantee a small average distance. Although the Web is a sparsely connected graph, it has been shown that it exhibits the characteristic of a small world [35][27][37], i.e. that most nodes can be reached from a random node by a comparatively small number of clicks. This is possible because links are unevenly distributed across nodes: there is a small number of nodes with a very high number of links while at the same time there is a large number of nodes with only a few links. The distribution of links follows a power law. For the Internet Albert et al [35] calculated a Zipf-exponent of 2.1 for the indegree distribution and of 2.72 for the outdegree distribution.

Figure 2 plots the distribution of indegree and outdegree for the websites in our sample. In the log-log plot, a straight line from the upper left corner to the lower right corner indicates a power law distribution. The distribution is stronger for larger sites, but even for the small sites, the indegree distribution roughly follows a power law. This is not the case for the outdegree distribution.

It is important to note that we only consider the distribution of links that are internal to the site, that is link source and link target are pages within the website of the audit office. In order to meaningful compare our distribution with distributions found

Figure 2. Distribution of in and outdegree for internal links in audit office websites as seen on a double logarithmic scale.

Although there is considerable variation, for inlinks, the five sites follow the general model of the Web, with a few nodes acting as internal authorities. While a few sites receive the majority of links, the websites differ in the amount to which they are centralized in their organization.

We use the closeness centralization coefficient denoted C_{d}(G) [17] to compare centralization of sites. Let G be an undirected graph with n nodes representing our dataset where V(G) and E(G) are sets of vertexes/pages and edges/links respectively. Let d(v,i) be the distance from node v to node i and S_{n}^{* }a star- network of n nodes. The closeness centrality of a node v denoted c_{d}(v) is defined as:

Table 5. Average degree per page (sum of incoming and outgoing links), standard deviation from average and median of degree. Degree closeness centralization C_{d}(G) both for the whole site crawled and for navigable content only.

Degree

Degree Closeness Centralization

c_{d }(v)

∑_{iV ( }_{G ) }n 1 d(v,i)

For the closeness centralization of a whole network we first calculate the overall variation in closeness centrality scores for all nodes, denoted v_{d}(G):

4.1.3 Average Degree, Degree Distribution and

Centralization The previously discussed measures all rely on the existence of links between documents. One could argue that the more links exist, the more likely it is that there exists a path between two randomly selected pages. In order to measure how densely connected a network is we can measure the average degree of a node, that is the average number of links pointing to or from a node in a network. The higher the average degree, the more links exist in a network. As we use the average, this measure can be compared across networks of different size. In the same way, we can measure the indegree and outdegree distributions. However, as we analyze a website and its internal links only, every link going out from one page is received by another page within our graph. Therefore, the average indegree equals the average outdegree. In accordance with our earlier findings, the values in Table 5 underline that sites with a smaller normalized average distance tend to have more links.