Table 2. Sizes of audit offices’ websites.
Table 2 enumerates the size of the crawl for each website, together with the estimated size of these sites as provided by Google, Yahoo and MSN. It can be seen that Google’s estimates are usually larger than MSN’s, which are usually larger than Yahoo’s. The source of this discrepancy is unknown but is probably connected to the size of the underlying index as well as the associated estimation technique.5 Reliable figures for the actual size of e-government websites are difficult to obtain. We are unaware of any government figures pertaining to this. While the number of pages we crawled is on the low side, personal correspondence with contacts at the UK audit office suggests that (at least for the UK) our crawl was exhaustive.
Web a path exists only for 25% of all pairs of nodes, i.e. in 75% of cases it is no possible to navigate between two random pages.
Broder et al also described the structure of the Web graph. At the centre is a strongly connected component (SCC) with a path between every pair of nodes. The IN component contains nodes that have a path to nodes in the SCC but not from the SCC. In the same way, nodes that are only reachable via a path from nodes in SCC but not conversely form the OUT component.
Table 3. Percentage of pages in strongly connected (SCC) and in OUT component, for both entire site as well as “navigable” site (i.e. without .pdf, .doc and image files) and percentage of documents removed by this filtering operation.
4. EXPERIMENTAL RESULTS
We analyzed the five datasets (CA, CZ, NZ, UK, US) for the audit offices of Canada, the Czech Republic, New Zealand, the UK, the USA, respectively. We examined both the internal structure and the external connectivity.Section 4.1 summarizes the internal structure, which is related to navigability, and Section 4.2 summarizes external connectivity, which we interpret as a measure of an institution’s nodality.
4.1 Internal Structure
%age of pdf+doc
The internal structure of a website can have a significant effect on its navigability when users only use the hyperlinks to navigate the site. For the five datasets, we examined (i) the size of the connected components, (ii) the average distance between randomly selected pairs of nodes and their distribution, and (iii) the distribution of links within a site. These three properties can clearly affect navigability and are discussed in detail below.
4.1.1 Connected Components Major advantage of hyperlinked environments is that they permit the user to navigate from one document to another and reach related documents. Which documents a user can reach on a website is primarily determined by the links that are included in each Web page. The drawback of not having any links from a page becomes obvious when we realize that a substantial proportion of users arrive via a search engine and therefore cannot use the back button to continue exploring the site. This seems to be supported by preliminary results of our user studies that show users starting navigation not from the top page but from deep inside a site (after arriving from a search engine) and then performing non-trivial navigation.
In terms of a user navigating a site via its links, any node contained in the SCC is reachable from any other node in the SCC, although the path from one node to another may be long (see Section 4.1.2). In contrast, nodes in the OUT component “trap” a user since it is not possible to reach the nodes in the SCC from there. None of our datasets contain an IN component because the crawls started at the top page of the site that is part of the SCC. Table 3 summarizes the sizes of the SCC and OUT components in our datasets. Since our crawler does not parse .pdf and .doc files for hyperlinks, these documents are always in the OUT component. Thus, websites that provide many reports in pdf or doc format may appear to have a small SCC. We therefore repeated our analysis of the datasets without pdf, doc and image pages so that only navigable pages were included in the analysis.
We expect that sites with a large SCC will be easier to navigate than sites with a small strongly connected component because a user will not get trapped on a page that does not provide any link back to the central core of the site. Our analysis of the entire site reveals that approximately 50% of the US, NZ and CA sites are strongly connected compared with only 34% for the UK and CZ..
Since a user can enter a site at an arbitrary location provided by a search engine, a very simple assumption is that a good website should provide the possibility to navigate to any other page on the site via its hyperlinks, in other words that there is a path between any two pages. In fact, Broder et al  established that for the
When only navigable content is considered, the SCC increases according to the number of pdf and Word documents on the site. New Zealand’s ranking improves dramatically, with almost 90% of its site forming part of the SCC. The good performance of the US is especially noteworthy as it is the largest site of all.
There are several potential drawbacks to comparing these numbers. One open question is whether some documents that could not be parsed for links by our crawler (for example pdf and