service). The service further load balances each cluster internally using Level-4 load balancing.
A general principle in service building is to use multiple levels of load balancing: among geographically distributed sites, among front ends in a cluster, and possibly among subsets of a cluster.
Front end. Once the service makes the load-bal- ancing decision for a request at the cluster level, it sends the request to a front-end machine. The front-end servers run stateless code that answers incoming user requests, contacts one or more back-end servers to obtain the data necessary to satisfy the request, and in some cases returns that data to the user (after processing it and rendering it into a suitable presentation format). We say “in some cases” because some services, such as Con- tent, return data to clients directly from the back- end server. The exact ratio of back end to front- end servers depends on the type of service and the performance characteristics of the back end and front-end systems. The ratio ranged from 1:10 for Online to 50:1 for ReadMostly.
Depending on its needs, the service might par- tition the front-end machines by functionality and/or data. Content and ReadMostly use neither characteristic; all front-end machines access data from all back-end machines. Online, however, does both. It partitions for functionality by using three kinds of front-end machines: front ends for stateful services, front ends for stateless services, and Web proxy caches, which are essentially front ends to the Internet because they act as Internet content caches. For data, Online further partitions the front ends to stateful services into service groups.
In all cases, the front-end machines were inex- pensive PCs or workstations. Table 3 summarizes the three services’ internal architectures. All three use customized front-end software rather than off- the-shelf Web servers for functionality or perfor- mance reasons.
Back end. Back-end servers provide services with the data they need to satisfy a request. These servers store persistent data, such as files (Content and ReadMostly) and e-mail, news articles, and user preferences (Online). The back end exhibits the greatest variability in node architecture: Con- tent and ReadMostly use inexpensive PCs, while Online uses file servers designed especially for high availability. Online uses Network Appliance’s Write Anywhere File Layout (WAFL) filesystem,
IEEE INTERNET COMPUTING
Large-Scale Internet Services
Paired backup site
User User requests/ requests/ responses responses
Paired backup site
Stateless Web front ends (dozens)
Stateful storage back ends (100s to 1,000s)
Figure 3. Architecture of a ReadMostly site. Web front ends direct requests to the appropriate back-end storage server(s). Commodity PC-based storage servers store persistent state, which is accessed via a custom protocol overTCP. A redundant pair of network switch- es connects the cluster to the Internet and to a twin backup site via a leased network connection.
Content uses a custom filesystem, and ReadMost- ly uses a Unix filesystem.
In the back end, data partitioning is key to over- all service scalability, availability, and maintain- ability. Scalability takes the form of scale-out — that is, services add back-end servers to absorb growing demands for storage and aggregate throughput. Online, Content, and ReadMostly rep- resent a spectrum of back-end redundancy schemes for improving availability and performance.
Each Online service group consists of approxi- mately 65,000 users and is assigned to a single back-end Network Appliance filer. Each filer uses a redundant array of inexpensive disks, level 4 (RAID-4), which allows the service group to toler- ate a single back-end disk failure. If a single filer fails, however, all users in that group lose access to their data. Moreover, the scheme does not improve performance, because the filer stores only one copy of the data.
Content partitions data among back-end stor- age nodes. Instead of using RAID, the service repli- cates each piece of data on a storage server at a twin data center. This scheme offers higher theo- retical availability than the Online scheme because it masks single disk failure, single storage server failure, and single site failure (due to networking problems, environmental issues, and the like). Content’s back-end data redundancy scheme does not improve performance either, because it assigns only one default site to retrieve data to service user requests, and uses the backup twin site only when
SEPTEMBER • OCTOBER 2002