tempt to integrate scalable concurrent allocation mechanisms with cache- and page-conscious object allocation mechanisms in a uni- fied design. Addressing these problems is a central contribution of this paper. We show that both memory management overhead and locality exploitation in thread-safe memory allocators can be im- proved, compared to what is currently offered by state-of-the-art multiprocessor allocators. These design improvements and the as- sociated performance benefits are also a key contribution of this paper.
We present Streamflow, a thread-safe allocator designed for both scalability and locality. Streamflow’s design is a direct re- sult of eliminating synchronization operations in the common case, while at the same time avoiding the memory blowup when strictly thread-local heaps are used in codes with producer-consumer allocation-freeing patterns. Local operations in Streamflow are synchronization-free. Not only do these operations proceed without thread contention due to locking shared data, but they also proceed without the latency imposed by uncontested locks and atomic in- structions. The synchronization-free design of local heaps enables Streamflow to exploit established sequential allocation optimiza- tions which are critical for locality, such as eliminating object headers for small objects and using bump-pointer allocation in page blocks comprising thread-local heaps.
Streamflow also includes an innovative remote object deallo- cation mechanism. Remote deallocations – namely deallocations of objects from threads different than the ones that initially al- located them – are decoupled from local allocations and deallo- cations by forwarding remotely freed objects to per-thread, non- blocking, lock-free lists. Streamflow’s remote deallocation mecha- nism enables lazy object reclamation from the owning thread. As a result, most allocation and deallocation operations proceed with- out the cost of atomic instructions, and the infrequent operations that do require atomic instructions are non-blocking, lock-free and provably fast under various producer-consumer object allocation- freeing patterns.
Streamflow’s design favors locality at multiple levels. Beyond reducing memory management overhead and latency, decoupling local and remote operations promotes temporal locality by allowing threads to favor locally recycled objects in their private heaps. The use of thread-local heaps reduces allocator-induced false sharing. Removing object headers improves spatial locality within cache lines and page blocks. The integration with a lower level custom page manager which utilizes superpages [19, 20] avoids allocator- induced cache conflicts via contiguous allocation of page blocks in physical memory, and reduces the activity of the OS page manager, the number of page faults and the rate of TLB misses. Combining these techniques produces a memory allocator that consistently outperforms other multithreaded allocators in experiments with up to 8 threads on a 4-processor system with Hyperthreaded Xeon processors. Streamflow, by design, also adapts well to sequential codes and performs competitively with optimized sequential and application-specific allocators.
This paper makes the following contributions:
We present a new thread-safe dynamic memory manager which bridges the design space between allocators focused on locality and allocators focused on scalability. To our knowledge, this is the first time a memory allocator efficiently unifies locality considerations with multiprocessor scalability.
We present a new method for eliminating (in the common case) and minimizing (in the uncommon case) synchronization over- head in multiprocessor memory allocators. Our method decou- ples remote and local free lists and uses a new non-blocking re- mote object deallocation mechanism. This technique preserves the desirable properties of a multiprocessor memory allocator,
namely blowup avoidance and false sharing avoidance, without sacrificing the locality and low latency benefits of bump-pointer allocation.
We present memory allocation and deallocation schemes that take into account cache-conscious layout of heaps, page- and TLB-locality. To our knowledge, Streamflow is the first mul- tiprocessor allocator designed with multilevel and multigrain locality considerations.
We demonstrate the performance advantages of our design using realistic sequential and multithreaded applications as well as synthesized benchmarks. Streamflow outperforms four widely used, state-of-the-art multiprocessor allocators in alloca- tion-intensive parallel applications. It also performs compara- bly to optimized sequential allocators in allocation-intensive sequential applications. Streamflow exhibits solid performance improvements both in codes with mostly local object allocation- freeing patterns and codes with producer-consumer object allocation-freeing patterns. We have experimented with an SMP with four two-way SMT processors1. Such SMPs are popular as commercial server platforms, affordable high-performance computing platforms for scientific problems, and building blocks for large-scale supercomputers. Since Streamflow elim- inates (in the common case) or significantly reduces (in the uncommon case) synchronization, the key scalability-limiting factor of multithreaded memory managers, we expect it to be scalable and efficient on larger shared-memory multiprocessors as well.
The rest of this paper is organized as follows. Section 2 dis- cusses related work. Section 3 presents the major design princi- ples, mechanisms and policies of Streamflow. Section 4 presents our experimental evaluation of Streamflow alongside other multi- processor allocators and some optimized sequential allocators. In Section 5 we discuss some implications of the design of Stream- flow and potential future improvements. Section 6 summarizes the paper.
2. Related Work
Streamflow includes elements adopted from efficient sequential memory allocators proposed in the past. Streamflow’s segregated heap storage and BIBOP (big bag of pages)-style allocation de- rives from an allocation scheme originally proposed by Guy Steele in  and from the concept of independently managed mem- ory zones which dates back to 1967 . Segregated heap storage has since been used in numerous allocators, including the standard GNU C allocator in Linux , an older GNU allocator , Vmal- loc , and more recent allocators such as Reaps  and Vam . Modern allocators tend to adopt segregated heaps because they en- able very fast allocation. Deallocation in segregated heap allocators is more intricate, because in order to comply with the semantics of free(), the allocator needs to be able to discover internally the size of each deallocated object, using the object pointer as its only input. Deallocation is simple if each object has a header pointing to the base of the heap segment from where the object was allo- cated. This technique is used, for example, in the GNU C allocator and in Reaps [4, 16]. However, object headers introduce fragmenta- tion, pollute caches, and eventually penalize codes with many small object allocations. Therefore, locality-conscious allocators such as PHKmalloc  and Vam  eliminate object headers entirely for small objects and use tables of free lists to manage released space in segregated heaps. Elimination of headers is common practice in custom memory allocators , as well as semi-custom allocators
This is the largest shared-memory system we have direct access to.