X hits on this document

PDF document

Scalable Locality-Conscious Multithreaded Memory Allocation - page 8 / 11





8 / 11

from 2.6M) and reduces TLB misses by 2%. The larger page size allows the coverage of the address space with significantly fewer pages and, thus, reduces the pressure to the OS page manager and the TLB.

4.2.2 Multithreaded Benchmarks


This is a custom synthetic microbenchmark that stresses

the ability of multithreaded allocators to efficiently perform simul- taneous memory management operations by multiple threads, for objects that are created and destroyed locally by each thread. Each thread allocates a total of 107 objects, the size of which is selected randomly from a given range. The benchmark can simulate differ- ent memory reuse patterns. Each thread deallocates all the objects it has allocated after every rate allocations, rate being a user-provided parameter. Recycle is not expected to scale with more processors (in terms of execution time reduction), since its workload is also scaled

with the number of threads.

Streamflow outperforms glibc by 41%, Hoard by 59%, Michael’s lock-free allocator by 48% and Tcmalloc by 14% in the sequen- tial execution of Recycle. In multithreaded executions Stream- flow performs significantly better than allocators that employ syn- chronization, such as glibc (8%–39%, avg. 25%), Hoard (11%– 54%, avg. 29%) and Michael’s allocator (10%–43%, avg. 22%). Its performance is similar to Tcmalloc, which uses thread-local, synchronization-free memory caches.

Allocators that put synchronization on the critical path of ev- ery operation—even Michael’s allocator, which uses lock-free, non-blocking synchronization—suffer from synchronization la- tency even during thread-local management operations, although these operations could be performed independently by each thread. Streamflow performs thread-local memory allocation and deallo- cation without atomic instructions. Recycle is only sensitive to allocation latency and scalability. Since allocated objects are not accessed in the code, and the memory footprint of the benchmark is rather small, it is insensitive to locality optimizations. As a result, the performance of the three versions of Streamflow is practically indistinguishable.

Larson: This is a benchmark which simulates a multithreaded server [15]. Objects allocated from a given thread are released by another thread. The thread that performs the deallocation is usually spawned after the termination of the thread that performed the al- location. We experimentally found that this is the case in 96.8% of all deallocations. Larson, thus, stresses the page block adop- tion functionality of Streamflow. The allocating thread is still alive in 3.2% of the deallocations, which activates the remote memory object deallocation and reclamation modules of Streamflow. The code runs for a fixed time interval and reports the attained through- put in terms of memory management operations. The number of threads spawned is proportional to the speed of memory allocation and deallocation4. Larson is sensitive to multithreaded allocation latency and scalability, as well as to the performance of the mem- ory recycling mechanisms of the allocator upon thread termination.

Streamflow, with all locality optimizations enabled, outper- forms Hoard by 56% on one thread and by more than 5.5 times on 8 threads (2.5 times on avgerage). The improvements over glibc range from 27% to 7.73 times (4.53 times on avg.). Streamflow achieves 1.9 to 2.6 times higher throughput than Tcmalloc (67% on avg.). Tcmalloc outperforms Streamflow only in one case, when Larson is executed sequentially. Finally, Streamflow proves 78% to 5.6 times (3.5 times on avg.) more efficient than Michael’s lock- free allocator. Note that Streamflow scales almost linearly with

4 We set Larson execution time to a relatively small value in order to limit the number of spawned threads below the threshold which triggers the fork- bomb protection mechanism in the operating system.

the number of threads in Larson, a particularly desirable property for real-world multithreaded server applications. The main reason for Streamflow performance is the efficient page block adoption strategy upon thread termination. A page block is adopted—with a non-blocking atomic operation—by the first thread that deallo- cates an object originating from it. Further deallocations to the page block from that thread are treated as local ones. Thus, they do not suffer even the minimal overhead of lock-free enqueueing to the remotely freed queue of the parent page block.

Objects in Larson are merely allocated and deallocated, being accessed only once, immediately after their allocation. However, the elimination of object headers enhances the spatial locality, especially at the page-level. Minor page faults are reduced (for the 8 threads execution) from 51K to 3.2K, resulting in a throughput improvement of 15%.

Consume: This is a synthetic microbenchmark from the Hoard distribution. It simulates produced-consumer applications, in which memory objects are allocated from one thread and are used and deallocated by other threads. The producers and consumers live simultaneously in the system. A single producer thread allocates n blocks of memory, each of which is then deallocated by one of the n different consumer threads. Memory allocations for a block can be performed simultaneously with deallocation of memory objects from other blocks. The number of threads, the size of the blocks and the number of allocation/deallocation rounds are specified by the user. Consume stresses the efficiency of remotely freeing memory through non-blocking atomic operations and the efficiency of lazy memory reclamation in Streamflow. The single producer thread is the main performance bottleneck of the application. As a result, the execution time of Consume is expected to increase almost linearly with the number of consumer threads. Moreover, since memory objects are simply allocated and deallocated, locality optimizations cannot be expected to have any effect.

Streamflow performs 25% to 1.3 times (avg. 77%) better than glibc and 60% to 8.7 times (avg. 5.1 times) better than Hoard. It also outperforms Michael’s lock-free allocator by 74% to 2.7 times (avg. 1.8 times) and Tcmalloc by 4% to 3.7 times (avg. 2 times).

Multithreaded allocators based on locks, such as glibc and Hoard must acquire and release at least one lock per deallocation operation. Tcmalloc and Michael’s allocator minimize the effects of producer-consumer memory usage patterns by using thread-local caches and atomic, lock-free operations respectively. In the case of Streamflow, each remote memory object deallocation is performed at the cost of a single, non-blocking, atomic operation. Moreover, the lazy memory reclamation strategy amortizes the cost of re- claiming the freed memory to that of a single atomic operation for all the objects in the remotely freed queue of the page block.

Knary: Hood benchmark which builds trees of arbitrary depth and fan-out degree and associates a user-defined amount of work per tree node [1]. Knary is sensitive to allocation latency and scal- ability, but not to locality, because the work performed per gener- ated tree node is typically small. It stresses the ability of allocators to serve simultaneous, mostly thread-local memory allocation and deallocation requests by multiple threads.

Streamflow outperforms glibc by 70% to 1.5 times (avg. 88%). It also provides significant performance improvements over Hoard, Tcmalloc and Michael’s allocator (64%, 73% and 76% on average respectively). Tcmalloc performs similarly with Streamflow only when Knary is executed sequentially.

The performance improvements can be attributed directly to the design of Streamflow, which minimizes—and in most cases eliminates— synchronization between threads performing simul- taneous memory management operations.

Document info
Document views21
Page views21
Page last viewedFri Oct 21 23:58:47 UTC 2016