X hits on this document

PDF document






9 / 9

Second, existing high-availability techniques such as redundancy almost completely masked hardware and software failures in Content and ReadMostly. Networks, however, are a surprising- ly significant cause of problems, largely because they often provide a single point of failure. Read- Mostly, for example, was the only service that used redundant first- and second-level switches within clusters. Also, consolidation in the collocation and network provider industries has increased the like- lihood that “redundant” networks will actually share a physical link or switch fairly close (in terms of Internet routing topology) to the colloca- tion site. And because most services use a single network operations center, they sometimes cannot see the network problems between collocation sites or between collocation sites and customers; these problems can be the root of mysterious perfor- mance problems or outages.

Finally, an overarching difficulty in diagnos- ing and fixing problems in Internet services relates to the fact that multiple administrative entities must coordinate their problem-solving efforts. These entities include the service’s net- work operations staff, its software developers, the operators of the collocation facilities that the ser- vice uses, the network providers between the ser- vice and its collocation facilities, and sometimes the customers. Today this coordination is handled almost entirely manually via telephone calls or e-mails between operators.

Tools for determining the root cause of prob- lems across administrative domains are rudimen- tary — traceroute, for example — and they gener- ally cannot distinguish between types of problems, such as end-host failures and problems on the net- work segment where a node is located. Moreover, the administrative entity that owns the broken hardware or software controls the tools for repair- ing a problem, not the service that determines the problem exists or is most affected by it. This results in increased mean time to repair.

Clearly, tools and techniques for problem diag- nosis and repair across administrative boundaries need to be improved. This issue is likely to become even more important in the era of composed net- work services built on top of emerging platforms such as Microsoft’s .Net and Sun’s SunOne.


The research we describe here opens a number of avenues for future research for members of the dependability community. Studying additional ser- vices and problem reports would increase the sta-


Large-Scale Internet Services

tistical significance of our findings with respect to failure causes and service architectures. Studying additional metrics such as Mean Time to Repair and/or the number of users affected by a failure would provide a view of service unavailability that more accurately gauges user impact than does the number-of-outages-caused metric we use here. Finally, one might analyze failure data with an eye toward identifying the causes of correlated failure, to indicate where additional safeguards might be added to existing systems.


  • 1.

    E. Brewer, “Lessons from Giant-Scale Services, IEEE Inter- net Computing, vol. 4, no. 4, July/Aug. 2001, pp. 46-55.

  • 2.

    Microsoft TechNet, “Building Scalable Services,” tech. report, www.microsoft.com/technet/treeview/default.asp?url=/ TechNet/itsolutions/ecommerce/deploy/projplan/bss1.asp,


  • 3.

    J. Gray, “Why Do Computers Stop and What Can Be Done About It?” Proc. Symp. Reliability in Distributed Software and Database Systems, IEEE CS Press, Los Alamitos, Calif., 1986, pp. 3-12.

  • 4.

    D. Kuhn, “Sources of Failure in the Public Switched Tele- phone Network,” Computer, vol. 30, no. 4, Apr. 1997, pp. 31-36.

  • 5.

    B. Murphy and T. Gant, “Measuring System and Software Reliability Using an Automated Data Collection Process,” Quality and Reliability Eng. Int’l, vol. 11, 1995, pp. 341-353.

  • 6.

    J. Xu, Z. Kalbarczyk, and R. Iyer, “Networked Windows NT System Field Failure Data Analysis,” Proc. 1999 Pacific Rim Int’l Symp. Dependable Computing, IEEE CS Press, Los Alamitos, Calif., 1999.

  • 7.

    D. Oppenheimer, “Why Do Internet Services Fail, and What Can Be Done about It?” tech. report UCB-CSD-02-1185, Univ. of Calif. at Berkeley, May 2002.

David Oppenheimer is a PhD candidate in computer science at the University of California at Berkeley. He received a BSE degree in electrical engineering from Princeton Uni- versity in 1996 and an MS degree in computer science from the University of California at Berkeley in 2002. His research interests include techniques to improve and evaluate the dependability of cluster and geographically distributed services.

David A. Patterson joined the faculty at the University of Cal- ifornia at Berkeley in 1977, where he now holds the Pardee Chair of Computer Science. He is a member of the Nation- al Academy of Engineering and is a fellow of both the ACM and the IEEE.

Readers can contact the authors at {davidopp,patterson} @cs.berkeley.edu.




Document info
Document views19
Page views19
Page last viewedThu Dec 08 05:58:04 UTC 2016