(v), the synchronization is performed using a single non-blocking atomic operation (cmp&swap).
3.1.2 Large Object Management
The management of large objects is significantly simpler than that of small objects. Large object requests are forwarded directly to the operating system. After memory is allocated from the system, the BIBOP table is updated to indicate that the corresponding virtual pages accommodate a large object. Finally, the object is prefixed with a header that contains the object size and the object is returned to the application.
Similarly, if the BIBOP table lookup during a deallocation iden- tifies an object as large, its header is recovered from the 8 bytes right before the object’s base address. As soon as the object size is determined, the memory occupied by the object is returned to the operating system.
3.2 Page Manager
The page manager implements page block allocations and dealloca- tions as needed by the multithreaded memory allocator in the front- end of Streamflow. Its functionality is threefold: i) It allocates and deallocates physical memory from/to the operating system, using superpages as the unit of allocation. ii) It allocates page blocks and manages space within superpages to achieve contiguous allocation of each page block in physical memory. iii) It optimizes the place- ment of multiple page blocks within superpages to avoid cache con- flicts within and between page blocks residing in the same super- page.
Most modern processors provide support for multiple page sizes. For example, Intel’s Itanium 2 provides eleven page sizes between 4 KB and 4 GB, Alpha processors provide four page sizes between 8 KB and 4MB, while the IBM Power4/Power5, Intel Xeon and UltraSPARC processors provide two page sizes, a small page size of 4 or 8 KB and a large page size of 4 MB. We use the term superpages to refer to pages of size larger than the smallest page size on a given architecture.
Superpages enable the coverage of large regions of the vir- tual address space with a small number of pages and TLB en- tries. Therefore, they can improve performance by reducing paging and TLB misses. The use of superpages can be particularly benefi- cial on simultaneous multithreading (SMT) processors, where more than one threads share a common TLB and the typically few TLB entries become a contested resource. More importantly, superpages enable contiguous allocation of large regions of the virtual address space in physical memory. Contiguous allocation of large blocks of virtual memory often improves cache performance on processors with large, physically indexed L2 caches, by eliminating or reduc- ing interference in the cache within and between page blocks.
Streamflow’s page manager is implemented on top of Linux 2.6, which provides support for superpages via a virtual filesys- tem. Streamflow allocates superpages by creating files in the vir- tual filesystem, and mapping these files in whole or in part to vir- tual memory. An allocated superpage is uniquely identified by the virtual file which backs the page and its disposition within this file.
The page manager associates a “header” data structure with each superpage. The collection of superpage headers holds all nec- essary information for the management of different superpages, as well as for management of space inside each superpage. Super- page headers reside in page blocks which are dynamically allocated from the operating system. The management of page blocks that store superpage headers is almost identical to the management of page blocks used for small objects (described in section 3.1.1). The main difference is that, since a page block with superpage headers is global and accesses to it are protected by a global page man- ager lock, functionality related to remotely freed objects, orphaned
page blocks and page block adoption is not necessary. Moreover, page blocks with superpage headers do not need to be freed, since the space they occupy is negligible. As a result, the data structures used for the management of page blocks with superpage headers do not need to be replicated with each page block.
Each superpage header includes the disposition of the superpage in the virtual file which backs the superpage. The page manager returns superpages to the operating system when all page blocks within a superpage are freed. Whenever a superpage is returned to the operating system (via munmap()), its header is recycled to the freed LIFO list of superpage headers, however the disposi- tion of the superpage is preserved in the header. As a result, the page manager can easily identify the dispositions of unmapped su- perpages inside a file, just by reusing recycled headers. The super- page headers also include prev and next pointers for linking su- perpages in lists, a pointer to the base virtual address of the super- page (sp base), the size (as a power of 2) of the largest contiguous free memory block inside the superpage (largest free order), as well as some bitmaps necessary for managing space inside the superpage.
Allocated superpages are organized in a hash table, indexed with the size (as a power of 2) of the requested page block. Using this hash table, the page manager can easily search for “best-fit” superpages, namely superpages where the largest contiguous free block is as close as possible to the size of the requested block.
Streamflow’s page manager allocates memory within each su- perpage using a buddy allocator [13, 14]. The buddy allocator tends to reduce memory fragmentation inside each superpage, being at the same time faster than first-, next-, and best-fit allocators.
We evaluated Streamflow on a 4-processor Dell PowerEdge 6650 server, with Hyperthreaded Intel Xeon processors clocked at 2.0 GHz. Hyperthreaded Intel processors can execute up to 2 threads simultaneously. Each processor has a 4-way associative 8 KB L1 data cache, a 12 KB instruction trace cache, a 512 KB 8-way associative L2 cache and an external 1 MB L3 cache. The system has 2 GB of RAM and runs Suse Linux 9.1 with the 18.104.22.168 kernel and glibc 2-3.3.
To compare the performance of Streamflow against other mul- tithreaded memory allocators, we evaluated the performance of Hoard (version 3.3.0) , Tcmalloc from Google’s performance tools (version 0.4) , our 32-bit implementation of Maged Michael’s lock-free allocator and the thread-safe allocator of glibc in Linux, which is based on Doug Lea’s memory allocator  with extensions for thread safety implemented by Wolfram Gloger . Hoard, Tcmalloc and glibc use local heaps with mul- tiple object size classes. Hoard, Tcmalloc and Michael’s allocator use a minimum object granularity of 8 bytes. The glibc allocator uses a minimum object granularity of 16 bytes. Tcmalloc is the only multithreaded allocator besides Streamflow that does not use object headers for small objects. We also compared the perfor- mance of Streamflow for a sequential application with that of glibc and Vam . Interestingly enough, the glibc allocator uses a more efficient non-thread-safe implementation of malloc() and free() if it detects that the code is not multithreaded. Streamflow, on the other hand, adapts to sequential codes as a consequence of its de- sign, which completely offloads synchronization from the critical path of sequential allocation. Vam is an optimized, strictly sequen- tial memory allocator that targets the improvement of application locality at both the cache- and the page-level. Vam uses fine-grain object size classes, headerless objects and reap style allocation.