Architecture and Dependability of Large-Scale Internet Services
Global Deployment of Data Centers
An analysis of the architectures and causes of failure at three large-scale Internet services can help developers plan reliable systems offering maximum availability.
I n the past few years, the popularity of large-scale Internet infrastructure ser- vices such as AOL, Google, and Hot- mail has grown enormously. The scalabil- ity and availability requirements of these services have led to system architectures that diverge significantly from those of traditional systems like desktops, enter- prise servers, or databases. Given the need for thousands of nodes, cost necessitates the use of inexpensive personal computers wherever possible, and efficiency often requires customized service software. Like- wise, addressing the goal of zero downtime requires human operator involvement and pervasive redundancy within clusters and between globally distributed data centers.
Despite these services’ success, their architectures — hardware, software, and operational — have developed in an ad
hoc manner that few have surveyed or analyzed.1,2 Moreover, the public knows little about why these services fail or about the operational practices used in an attempt to keep them running 24/7. Existing failure studies have examined hardware and software platforms not commonly used for running Internet ser- vices, or in operational environments unlike those of Internet services. J. Gray, for example, studied fault-tolerant Tan- dem systems;3 D. Kuhn studied the pub- lic telephone network;4 B. Murphy and T. Gant examined VAX systems;5 and J. Xu and his colleagues studied failures in enterprise-scale Windows NT networks.6
As a first step toward formalizing the principles for building highly available and maintainable large-scale Internet ser- vices, we are surveying existing services’
David Oppenheimer and David A. Patterson University of California at Berkeley
IEEE INTERNET COMPUTING
SEPTEMBER • OCTOBER 2002