X hits on this document

PDF document

The Web Structure of E-Government - Developing a Methodology for Quantitative Evaluation - page 3 / 10





3 / 10

However, if we triangulate the five countries across the sample of the three rankings given that include all countries in our sample (by giving 5 pts for each 1st place, 4 for 2nd etc. and adding them), we find that the US and Canada emerge in the top slot (with 13 and 12 respectively out of a possible 15), followed by the UK and New Zealand (with 9 and 8). The Czech Republic occupies the last place (3). Consequently, while we may not be able to discriminate between the USA and Canada, we are interested to determine whether structural metrics reveal distinctions between the US and Canada on one hand and the less well performing countries on the other.

2.2 Previous Web Metric Work

There have been many studies concerned with website usability from a user perspective. We do not review this literature here, as it is outside the scope of this paper, which is concerned only with the link structure of the sites and their neighborhood.

The idea of a link as an endorsement, inspired by bibliometrics, has been successfully applied to a wide range of problems from ranking algorithms [26][20][30][31], through focused crawling [2] to web page classification and clustering [30][32][33].

There have also been extensive studies investigating the structure of the Web [24][27][35], as well as proposals for its generative models [10][27][28][36], all of which noted the scale-free structure of the network.

Usually, the study of hyperlink structure has focused on academic networks [3][19]. Studies have benefited greatly from the methods developed for social network analysis (see for example [17]) and in recent years researchers from various areas have tried to apply these methods to the Internet by interpreting the relation between actors through the hyperlink connections of their websites [18].

The application of computer science methods to the study of politics on the web and e-government in particular is not yet very common, although there are some notable exceptions. For example, Hindman et al [10] studied the communities surrounding political sites and showed that (i) the number of incoming links is highly correlated with the number of actual users and (ii) that online communities are usually dominated by a few sites – winners who take all the attention. Overall, applications of computer science, and especially Web metrics, to the quantitative evaluation of e-government have not been reported.


For purposes of clarity, Section 3.1 briefly defines the vocabulary used to describe the Web. Section 3.2 then describes the datasets used in our evaluation and how these datasets were acquired.

3.1 Definition of Terms

Networks have been studied in a variety of different fields, including computer science and social studies. The diversity of these disciplines has led to a diversity of vocabulary so a brief definition of terms seems useful.

A network consists of a set of nodes and a set of directed links that connect pairs of nodes. In our case, the nodes are documents retrieved from the Internet and the links are hyperlinks that can be used to navigate between these documents. Every link has a link source that is the node from which it originates, and a link target, which is the node to which it points. A node can be described in terms of its links: every node has an indegree which represents the number of links for which the node is a link target, and an

outdegree which represents the number of links for which the node is the link source. The sum of the indegree and outdegree is the degree of a node. A node with non-zero indegree is receiving links or has inlinks, while a node with non-zero outdegree has outlinks or is pointing to another node.

We distinguish different entities on the Web that are ranked in a hierarchy. The smallest entity is a document on the Web, identified by a Uniform Resource Locator (URL). A set of documents constitutes a website. A website is primarily conceptual and depends very much on the perception of the user or the author of the site. It is difficult to identify websites automatically.1 In this paper, we will assume that a website is identified by a host, i.e. everything between http:// and the next slash stripped off generic ‘www’ prefix and any trailing port numbers. Although not perfect, this automated approach seems sufficient [21]. Hosts can be classified according to domain level. Reading from the right, every combination of letters preceded by a dot constitutes a level of domain.

3.2 Datasets

As noted above, the size of e-government sites can be very large. Furthermore, it is very difficult to define what e-government encompasses. For example, should the military, local government, broadcasting or health services be included? These issues have to be resolved before analysis at the level of a whole government can take place. For this preliminary study, we therefore selected a pilot agency - the government audit office of each country – on which to test our methodology. These audit offices have well- defined and comparable roles and responsibilities and all operate sites of relatively small size when compared with the whole government domain. For our research we selected the audit offices of Canada (CA), the UK and the USA as the three major English-speaking countries as well as New Zealand (NZ) and the Czech Republic (CZ), both to include smaller sites in the sample and to show that the methodology is language independent.

During October 2005, all documents and associated links were collected from the websites using the Nutch 1.6.0 crawler2. We started each crawl from the home page of the associated audit office and collected all pages up to a depth of 18. The crawl was restricted to the audit office domain and the crawler was configured to allow for complete site acquisition.3 The composition of our crawl, according to page type, is shown in Table 2. In addition to crawling, we acquired the URLs pointing to these websites using Google’s reverse lookup capability. It should be noted that the link data provided by Google is limited to the top 1000 results – affecting the highly linked pages in our dataset. Also Google inlink information has been previously shown to be of limited reliability4.


For example the following institutions at University College London (http://www.ucl.ac.uk) use different form of URL: http://www.cs.ucl.ac.uk/ (Department of Computer Science) http://www.ucl.ac.uk/spp/ (School of Public Policy)




The non default parameters being a 5s delay between requests to the same host, and 10,000 attempts to retrieve pages that failed with a soft error. Our crawler followed http links to files, skipping unparseable content.


Document info
Document views32
Page views33
Page last viewedWed Dec 21 22:34:43 UTC 2016