This testing and deployment methodology inte- grates the operations team from the very begin- ning, considering operators as first-class users of the software as much as end users.
Deployment Deploying software to large clusters requires automation for both efficiency and correctness. The three services use tools developed in-house to deploy and upgrade production software and con- figurations for applications and operating systems. They all use rolling upgrades to deploy software to their clusters.1 All three use version control dur- ing software development, and Online and Read- Mostly use it to manage configuration as well. Online always keeps two versions of the service software installed on every machine, allowing administrators to revert to an older version in less than five minutes if a new installation goes awry.
Daily Operation Service monitoring, problem diagnosis, and repair are significant challenges for large-scale Internet services. Reasons for this include
frequency of software and configuration changes,
scale of the services,
unacceptability of taking the service down to localize and repair problems,
complex interactions between application software and configuration, operating system software and configuration,
collocation site networks,
networks connecting customer sites to the main service site,
networks connecting collocation sites, and
operators at collocation sites, Internet service providers, and the service’s operations center.
Back-end data partitioning presents a key chal- lenge to maintainability. Data must be distributed among back-end storage nodes to balance load and to avoid exceeding server storage capacity. As the number of back-end machines grows, opera- tors generally repartition back-end data to incor- porate the new machines. Currently, humans make partitioning and repartitioning decisions, and the repartitioning process is automated by tools that are at best ad hoc.
The scale, complexity, and speed of Internet ser- vice evolution leads to frequent installations, upgrades, problem diagnosis, and system compo- nent repair. These tasks require a deeper under-
IEEE INTERNET COMPUTING
Large-Scale Internet Services
standing of the internals of the application soft- ware than is required to install and operate slow- ly evolving, thoroughly documented enterprise or desktop software. The Internet services we studied encourage frequent dialogue between operators and developers during upgrades, problem diagno- sis, and repair — for example, when distinguish- ing between an application crash caused by a soft- ware bug (generally the responsibility of software developers) and an application crash caused by a configuration error (generally the responsibility of the operations team). As a result, these services intentionally blur the line between “operations” and “software development” personnel.
Despite automated tasks such as monitoring and deploying software to a cluster, human operators are key to service evolution and availability. More- over, because an operator is as much a user of Internet service software as is an end user, devel- opers should integrate operator usability through- out the software development process. This is especially true because many operational tasks are not fully automated or not automated at all — such as configuring and reconfiguring software and network equipment, identifying a problem’s root cause, and fixing hardware, software, and net- working issues. Conversely, much opportunity exists for more sophisticated and standardized problem diagnosis and configuration management tools, as all of the services we surveyed used some combination of their own customized tools and manual operator action for these tasks.
Our architectural and operational study reveals per- vasive use of hardware and software redundancy at the geographic and node levels, and 24/7 operations centers staffed by personnel who monitor and respond to problems. Nonetheless, outages at large- scale Internet services are relatively frequent.
Because we are interested in why and how large-scale Internet services fail, we studied indi- vidual problem reports rather than aggregate availability statistics. The operations staff of all three services uses problem-tracking databases to record information about component and service failures. Online and Content gave us access to these databases; ReadMostly gave us access to their problem post mortem reports.
Component and Service Failures We define a component failure as a failure of some service component to behave as expected — a hard drive failing to spin up when it is power-cycled, a
SEPTEMBER • OCTOBER 2002