Traditional desktop applications Productivity applications, games
Traditional high-availability applications Database, enterprise messaging
Typical hardware platform
Fault-tolerant server or failover cluster
Software model Software release frequency Networking environment
Off-the-shelf, standalone Months None (standalone)
Off-the-shelf, multitier Months Enterprise-scale
Corporate IT staff
Typical physical environment Key metrics
Home or office Functionality, interactive latency
Corporate machine room Availability, throughput
Global Deployment of Data Centers
Table 1. Comparison of Internet service application characteristics with those of traditional applications.
Large-scale Internet service applications E-mail, search, news, e-commerce, data storage Clusters of hundreds to thousands of cheap PCs, often geographically distributed Customized, multitier Days or weeks Within cluster, on the Internet between data centers, and to and from user clients Service operations staff, data center/collocation site staff Collocation facility Availability, functionality, scalability, manageability, throughput
architectures and dependability. This article des- cribes our observations to date.
A New Application Environment
The dominant software platform is evolving from one of shrink-wrapped applications installed on end-user PCs to one of Internet-based application services deployed in large-scale, globally distributed clusters. Although they share some characteristics with traditional desktop and high-availability soft- ware, services comprise a unique combination of building blocks, with different requirements and operating environments. Table 1 summarizes the differences between the three types of applications.
Users increasingly see large-scale Internet ser- vices as essential to the world’s communications infrastructure and demand 24/7 availability. As Table 1 suggests, however, these services present an availability challenge because they
typically comprise many inexpensive PCs that lack expensive reliability features;
undergo frequent scaling, reconfiguration, and functionality changes;
often run custom software that has undergone limited testing;
rely on networks within service clusters, between geographically distributed collocation facilities, and between collocation facilities and end-user clients; and
aim to be available 24/7 for users worldwide, making planned downtime undesirable or impossible.
On the other hand, service architects can ex- ploit some features of large-scale services to en- hance availability:
Plentiful hardware allows for redundancy.
Geographic distribution of collocation facilities allows control of environmental conditions and resilience to large-scale disasters.
In-house development means that operators can learn the software’s inner workings from its developers, and can thus identify and cor- rect problems more quickly than IT staffs run- ning shrink-wrapped software that they see only as a black box.
We examine some techniques that exploit these characteristics.
Maintainability is closely related to availabili- ty. Not only do services rely on human operators to keep them running, but the same issues of scale, complexity, and rate of change that lead to fail- ures also make the services difficult to manage. Yet, we found that existing tools and techniques for configuring systems, visualizing system con- figurations and the changes made to them, parti- tioning and migrating persistent data, and detect- ing, diagnosing, and fixing problems are largely ad hoc and often inadequate.
Architectural Case Studies
To demonstrate the general principles commonly used to construct large-scale Internet services, we examine the architectures of three representative
SEPTEMBER • OCTOBER 2002
IEEE INTERNET COMPUTING