Hood implementation of a parallel tree building
benchmark Hood implementation of the Barnes-Hut algorithm
131072 bodies d=0.025, 10 its.
Multithreaded mesh generation application
Table 1. Benchmarks used to evaluate
SPECINT2000 English parser (sequential)
Synthetic benchmark, with variable object recycling frequency Multithreaded server simulator Producer-consumer benchmark
107 objs./thread 8b objects, rate=1000 5-seconds run 6000 objs/block 5000 iterations
Avg. object size (< 2KB)
Remote Frees (%)
Table 1 summarizes the benchmarks we used to evaluate Streamflow. The table includes the total number of small objects (smaller than 2KB) allocated in each benchmark, the number of large objects, the average size of small objects, and the percent- age of remotely freed objects in the multithreaded benchmarks. It is evident that the vast majority of objects are deallocated by the same thread that allocated them, even in codes in which ob- jects are accessed by multiple threads. The only exceptions are the Consume and Larson benchmarks which are explicitly designed to stress the allocators under extreme remote memory dealloca- tion and reclamation conditions. This property puts Streamflow at an advantage against other allocators, thanks to its design that re- moves the synchronization overhead from almost all allocations and deallocations.
The results of our experiments are summarized in Figures 2 and 3. We report execution times for 197.parser, Recycle, Consume, Knary, Barnes and MPCDM. Larson runs for a fixed time inter- val, so we report its throughput, measured in memory management operations per microsecond. For the sequential 197.parser we re- port the execution time of both the optimized non-thread-safe, and the thread-safe glibc allocator implementation. For 197.parser, we also report execution times for the custom, hard-coded allocator of the benchmark and Vam. Note that we bypass the custom al- locator of the benchmark when we measure the performance of Streamflow and the other general-purpose allocators. This means that all allocations and deallocations are directed to the malloc() and free() calls of the general-purpose allocators rather to the xalloc() and xfree() calls of the custom allocator of the bench- mark. For all benchmarks we also report the performance of the four thread-safe allocators (Streamflow, Michael’s, Hoard and Tc- malloc). Specifically for Streamflow, we provide three data points: one for a base implementation which performs decoupling of lo- cal allocation from remote deallocation but no further locality op- timizations (labeled “Streamflow headers” in the charts); a second from an improved implementation which eliminates object headers, in addition to decoupling, to improve spatial locality and cache us- age (labeled “Streamflow wo headers” in the charts); and a third from a complete implementation which includes the superpage manager and page block layout optimizations (labeled “Streamflow superpages” in the charts).
Results and Analysis
This benchmark is an English language parser bench-
mark from SPECINT2000 which includes a custom memory allo- cator. This custom allocator works well for objects allocated and deallocated in a stack-like fashion. 197.parser spends more than 40% of its execution time in memory allocation and stresses the efficiency of object placement in memory as well as the ability of
thread-safe allocators to provide fast sequential allocation. Most thread-safe allocators suffer in this aspect because they impose synchronization overhead, due to unnecessary execution of atomic instructions, even though there is no contention between threads in a sequential program. 197.parser is also representative of many ap- plications that use custom allocators for higher performance [4, 7].
Execution Time (sec.)
Custom sequential Vam Glibc
Streamflow headers Streamflow wo headers Streamflow superpages
Tcmalloc Michael Glibc (MT)
Figure 2. Execution time (lower is better) attained by different allocators for 197.parser.
The performance of Streamflow with all locality optimiza- tions is within 8% of the performance of the custom allocator in 197.parser and within 5% of the performance of Vam. Both the custom allocator and Vam apply bump-pointer memory allocation. Moreover, the custom allocator changes the semantics of free() to provide the memory object size to the deallocation function. Streamflow performs 7% faster than the optimized, sequential glibc allocator.
Streamflow outperforms the thread-safe glibc allocator by 36% and Michael’s allocator by 40%, due to the fact that the latter two suffer the unnecessarily high—although contention free—overhead of atomic instructions. 197.parser crashes when executed with Hoard, and as a result a comparison with Hoard is not possible.
The elimination of object headers results in a 15% performance improvement over the base implementation of Streamflow. Head- erless objects contribute to better spatial locality at both the cache- and page-level. Minor page faults, for example, drop from 3.8M to 2.6M. Streamflow without headers is 8.5% faster than Tcmalloc, which also uses headerless objects. The cost for a common-case, uncontested memory allocation is 200 cycles for Streamflow, com- pared with 214 cycles for Tcmalloc.
The use of superpages yields an additional 5% execution time improvement over the base Streamflow implementation. It also almost completely eliminates minor page faults (just 713, down