with alternate semantics for free(), such as region-based alloca- tors .
Streamflow uses segregated object allocation in thread-private heaps, as in several other thread-safe allocators including Hoard , Maged Michael’s lock-free memory allocator , Tcmalloc from Google’s performance tools , LKmalloc , ptmalloc , and Vee and Hsu’s allocator . In particular, Streamflow uses strictly thread-local object allocation, both thread-local and remote deallocation and mechanisms for recycling free page blocks to avoid false sharing and memory blowup [3, 18].
Streamflow differs from earlier multithreaded memory alloca- tors in several critical aspects: First, its design decouples local from remote object deallocation to allow local allocation and deal- location without any atomic instructions. Atomic instructions are used only sparingly for remote object deallocation and for recy- cling page blocks. Second, Streamflow eliminates object headers for small objects, thereby reducing cache pollution and improv- ing spatial locality. Tcmalloc is the only thread-safe allocator we are aware of that uses the same technique, although Tcmalloc uses locks whenever memory has to be allocated from or returned to a global free memory objects pool. Third, Streamflow uses further optimizations for temporal locality, cache-conscious page block layout and better TLB performance. Fourth, unlike many other high performance allocators, Streamflow allows returning memory to the OS when the footprint of the application shrinks.
To our knowledge, Streamflow is the first user-level memory allocator to control the layout of page blocks in physical memory, using superpages as the means to achieve contiguous allocation of each page block in physical memory. It should be noted that superpages are a generic optimization tool and their scope extends beyond just memory allocators [6, 19]. However, since superpages (the size of which is set by the operating system) may subsume multiple page blocks (the size of which is set by the memory allocator) a multiprocessor memory allocator using superpages to achieve cache-conscious of page blocks has certain design choices as to how it manages free memory inside each superpage and how it divides superpages between page blocks from different threads. Streamflow’s design includes some educated choices for effective management and utilization of superpages.
Several of the design goals of Streamflow, in particular its local- ity optimizations, can be achieved with allocators that utilize feed- back from program profiles. For example, earlier work has shown that object lifetime predictors and reference traces can be used to customize small object allocation and object segregation [2, 22, 23]. Streamflow assumes no knowledge of object allocation and access profiles, although its design does not prevent the addition of profile- guided optimization.
3. Design of Streamflow
Streamflow primarily optimizes dynamic allocation of small ob- jects, which is a common bottleneck in many sequential and mul- tithreaded applications, including desktop, server and scientific ap- plications. Streamflow optimizes dynamic allocation for low la- tency and scalability, as well as for temporal locality, spatial lo- cality and cache-conscious layout of data. These optimizations are accomplished via the use of a decoupled local heap architecture, the elimination of object headers, the careful layout of heaps in con- tiguously allocated physical memory and the exploitation of large pages (superpages). At the same time, Streamflow provides mech- anisms that facilitate both memory transfer between local heaps and returning memory to the system. As a result, it is not sensitive to pathologic memory usage patterns, such as producer-consumer ones, that could lead to high memory overhead and pressure.
Streamflow consists of two modules. Its front-end is a multi- threaded memory allocator, which minimizes the overhead of mem-
ory requests by eliminating inter-thread synchronization and all as- sociated atomic operations during common-case memory request patterns. Even in the infrequent cases when synchronization be- tween threads is necessary, it is performed with a single, non- blocking atomic operation2. The front end also includes optimiza- tions for spatial locality, temporal locality, and the avoidance of false-sharing.
The back-end of Streamflow is a locality-conscious page man- ager. This module manages contiguous page blocks, each of which is used by the front-end for the allocation of objects that belong to a given size class. The page manager allocates pages blocks within superpages to achieve contiguous layout of each page block in physical memory, thus reducing self-interference (within page blocks) and cross-interference (between page blocks) in the cache. The use of superpages can also improve the TLB performance and reduce page faults in applications with large memory footprints. Moreover, the Streamflow back-end facilitates the interchange of page blocks between threads, should the memory demand of each thread change during execution.
We describe the front-end multithreaded memory allocator in Section 3.1 and the back-end page manager in Section 3.2. The source code of Streamflow can be downloaded from http://www. cs.wm.edu/streamflow and can be used as a reference through- out this section.
Multithreaded Memory Allocator
Small Object Management
Objects in Streamflow are classified as small if their size does not exceed 2 KB (half a page in our experimental platform). The man- agement of objects larger than 2 KB is described in section 3.1.2. In the following paragraphs we describe Streamflow’s heap architec- ture, the techniques used to eliminate object headers, small object allocation and deallocation procedures and specialized support for recycling memory upon thread termination.
Local heaps: a local heap.
Each thread in Streamflow allocates memory from The heap data structure, shown in Figure 1(a), is
strictly private; only the owner thread can modify it. As a result, the vast majority of simultaneous memory management operations issued by multiple threads can be served simultaneously and inde- pendently, without synchronization. Synchronization is necessary only when the local heap does not have enough free memory avail- able to fulfill a request, or during deallocations, when an object is freed by a thread other than the owner of the heap it was allocated
Local heaps facilitate the reduction of allocator-induced false- sharing between threads, since memory allocation requests by dif- ferent threads are not interleaved in the same memory segment. This technique cannot, however, totally eliminate false-sharing in the presence of object migrations between threads .
Each thread-local heap consists of page blocks, shown in Fig- ure 1(b). Page blocks are contiguous virtual memory areas. Each page block is used for the allocation of objects with sizes that fall into a specific range, which we call an object class. In Streamflow, each object class differs from the previous one by 4 bytes. This de- sign provides for fine-grain object segregation and tends to improve spatial locality in codes that make heavy use of very small objects . One or more page blocks, organized as a doubly linked list, can
2 We use cmp&swap(ptr, old val, new val), which atomically checks that the value in memory address ptr is old val and changes it to new val. If the value is not equal to old val the operation fails. The op- eration may be replayed more than once if it fails. All modern processors offer cmp&swap for 32-bit and 64-bit operands, either as an instruction or as a high level primitive built from simpler instructions, such as load linked- store conditional.