Global Deployment of Data Centers
Component failure System failure
Number of incidents
0 Node hardware
Figure 4. Component failures and resulting service failures for Con- tent site. For all categories but node operator, 11 percent or fewer component failures led to a service failure.
Node operator Network operator Node hardware Network hardware Node software Network software Unknown node error Unknown network error Environment
45 5 0 0 25 0 0 15 0
5 14 0 10 5 19 0 33 0
Table 4. Failure cause by type for Content and ReadMostly (in percentages).
software crash, an operator misconfiguring a net- work switch, and so on. A component failure caus- es a service failure if it theoretically prevents an end user from accessing the service or a part of the service (even if the user receives a reasonable error message) or if it significantly degrades a user-vis- ible aspect of system performance. We say “theo- retically” because we infer a component failure’s impact from various aspects of the problem report, such as the components that failed and any cus- tomer complaint reports, combined with knowl- edge of the service architecture.
We studied 18 service failures from Online (based on 85 component failures), 20 from Content (based on 99 component failures), and 21 from ReadMost- ly. The problems corresponded to the service fail- ures during four months at Online, one month at Content, and six months at ReadMostly. A detailed
SEPTEMBER • OCTOBER 2002
analysis of this data appears elsewhere; 7 here we highlight a few data attributes from Content and ReadMostly.
We attributed a service failure to the first com- ponent that failed in the chain of events leading to the failure. We categorized this root-cause com- ponent by location (front end, back end, network, unknown) and type (node hardware, node soft- ware, network hardware, network software, oper- ator, environment, overload, and unknown). Front- end nodes are initially contacted by end user clients and, at Content, by client proxy nodes. front-end nodes do not store persistent data, but they may cache data. back-end nodes store persis- tent data. We include the “business logic” of tra- ditional three-tier system terminology in our front- end definition because these services integrate their service logic with the code that receives and replies to user client requests.
Using Content data, we observe that the rate at which component failures turn into service fail- ures depends on the reason for the original prob- lem, as Figure 4 shows. We list only those cate- gories for which we classified at least three component failures (operator error related to node operation, node hardware failure, node software failure, and network failure of unknown cause).
Most network failures have no known cause because they involve Internet connections, prob- lems between collocation facilities, or between customer proxy sites and collocation facilities. For all but the node operator case, 11 percent or fewer component failures became service failures. Fully half of the 18 operator errors resulted in service failure, suggesting that operator errors are signif- icantly more difficult to mask using existing high- availability techniques such as redundancy. Table 4 lists service failures by type.
Observations Our analysis of the failure data so far leads us to several observations. First, operator error is the largest single root cause of failures for Content, and the second largest for ReadMostly. In general, failures arose when operators made changes to the system — for example, scaling or replacing hard- ware or reconfiguring, deploying, or upgrading software. Most operator errors occurred during normal maintenance, not while an operator was in the process of fixing a problem. Despite the huge contribution of operator error to service failures, developers almost completely overlook these errors when designing high-dependability systems and the tools to monitor and control them.
IEEE INTERNET COMPUTING