Streamflow w/o headers
Streamflow w/ superpages
Table 2. Maximum virtual memory footprint of the benchmarks (in MB) when executed with different multithreaded allocators. Multithreaded benchmarks are executed with 8 threads. In the case of Larson we also report the memory required just for thread stacks in each case. 197.parser crashes when executed with Hoard, so no footprint value is reported.
sured by querying the /proc filesystem5 for each process’ total vir- tual memory consumption every tenth of a second for the lifetime of the application. The value reported is the maximum observed for each application / memory allocator pair.
With the exception of Larson, Streamflow achieves virtual memory footprints smaller than glibc and comparable to the other allocators. Larson continuously generates threads that perform a constant number of allocation and deallocation operations, spawn new threads, and then terminate without ever being joined. Since Larson runs for a fixed time period, the number of threads spawned by the application is proportional to the achieved rate of alloca- tion and deallocation operations. Tracing system calls performed by the application revealed that before each thread generation, 513 memory pages (2052 KB) are allocated for the thread’s stack. The system calls trace also revealed that—as expected—since threads are never joined, their stacks are never freed and reused. As a re- sult, the virtual memory footprint of the application is dominated by thread stacks. In fact, the virtual memory footprint grows mono- tonically during the execution life of the application, with a rate that is linearly related to the throughput of memory management operations achieved by each multithreaded memory manager. Ta- ble 2 reports—in the case of Larson—the total maximum virtual memory footprint of the application, as well as the maximum vir- tual memory footprint of thread stacks. The latter is calculated by multiplying the total number of threads generated by the applica- tion by 2052 KB (the stack size).
It is worth noting that Streamflow performs well even with Consume, which is specifically designed to stress multithreaded allocators that use thread-local heaps. Allocators which use strictly thread-local heaps are sensitive to memory blowup under producer- consumer memory usage patterns.
5. Discussion and Future Directions
Streamflow uses superpages as a tool to avoid cache conflicts through the allocation of page blocks directly in physical mem- ory. The use of superpages also provides the necessary infrastruc- ture to investigate cache-color aware placement of page blocks and demonstrate the potential of multilevel locality optimizations within a scalable memory allocator. However, imposing the use of superpages in all programs has certain disadvantages. Some of the
5 /proc is a virtual filesystem available on most UNIX-like operating sys- tems that exposes information from the OS kernel to user-level at runtime.
most important ones are severe fragmentation for small programs and unjustified memory pressure, which may occur in a multipro- grammed system in which some of the programs make extensive use of superpages but utilize little space within each page. One way to address these problems is to leverage operating system support for dynamic superpage management .
Although Streamflow provides support for relinquishing page blocks back to the operating system, it does not do so adaptively, as a reaction to memory pressure . Extending Streamflow with mechanisms and policies to detect memory pressure and proac- tively release memory to prevent thrashing is left as future work.
Streamflow was designed under the assumption that dynamic feedback such as actual object sizes and lifetimes is not available to the allocator [11, 22]. Such profiles enable customizations, such as reap-style object deallocation of short-lived objects , or object segregation based on access frequency and length of object lifetimes . In general, profiling information has not been used so far in multiprocessor memory allocators and it is a path we would like to explore in the near future. Profiling may prove useful for customizing Streamflow’s allocation and deallocation policies for exploiting more aggressively specific types of locality, such as locality in streams of accesses to objects from different classes.
As multicore and simultaneous multithreading processors be- come commonplace, it is important to consider the implications of these processors on multithreaded memory allocation. Some of the related considerations were discussed in . The main challenge for a locality-conscious allocator for chip multiprocessors is mak- ing good use of a large shared on-chip cache. The fact that threads can share data through a cache requires the allocator to customize its page block management policies so that page blocks belonging to different threads that run on the same processor are allocated contiguously and conflict-free, if possible. Streamflow’s design en- ables this optimization, pending the addition of feedback from the operating system so that the allocator becomes aware of the place- ment of threads on execution cores at runtime.
Multiprocessor memory allocators have so far capitalized on scal- ability. Optimized, sequential allocators, on the other hand, place emphasis on locality. In this paper we have presented Streamflow, a high-performance, low-overhead thread-safe memory allocator also designed to favor locality at several levels of the memory hier- archy.
Streamflow’s design decouples local and remote operations in order to eliminate synchronization for most memory allocation op- erations, while still avoiding memory blowup which strictly local- heaps suffer from. In order to further reduce latency, all synchro- nization operations are non-blocking and lock-free. This scalable and locality-conscious design enables Streamflow to perform com- parably to optimized sequential allocators, yet be usually signif- icantly faster than other multiprocessor allocators. These proper- ties are consolidated in a unified segregated heap design. Stream- flow also improves cache-, TLB-, and page-level locality via care- ful layout of heaps in memory, careful reuse of freed objects and the exploitation of superpages. Put together, these properties make Streamflow an attractive unified framework for sequential and par- allel memory allocation and a useful tool for taming the ever- increasing memory latencies in codes that rely heavily on dynamic memory allocation.
The design space for locality-conscious multiprocessor memory allocators is vast. Streamflow represents a realistic point in this design space and a step in the direction of composing adaptive memory allocators with sufficient self-customization capabilities for multiple design goals, such as locality and parallelism.