X hits on this document

PDF document

Scalable Locality-Conscious Multithreaded Memory Allocation - page 9 / 11

34 views

0 shares

0 downloads

0 comments

9 / 11

Execution Time (sec.)

70

60

50

40

30

20

10

  • 0

    1

2

3

R e c y c l e S t r e a m f l o w h e a d e r s S t r e a m f l o w w o h e a d e r s S t r e a m f l o w s u p e r p a g e s H o a r d T c m a l l o c M i c h a e l G l i b c

4 5 T h r e a d s

6

7

Throughput (Mops / sec)

8

16

14

12

10

8

6

4

2

0

1

2

S t r e a m f l o w S t r e a m f l o w S t r e a m f l o w H o a r d T c m a l l o c M i c h a e l G l i b c

3

Larson

headers wo headers superpages

4 5 T h r e a d s

6

7

400

350

Execution Time (sec.)

100

300

250

200

150

50

8

  • 0

    1

S t r e a m f l o w S t r e a m f l o w S t r e a m f l o w H o a r d T c m a l l o c M i c h a e l G l i b c

2

3

Consume

headers wo headers superpages

4 5 T h r e a d s

6

7

8

Execution Time (sec.)

40

35

30

25

20

15

10

5

  • 0

    1

2

3

K n a r y S t r e a m f l o w h e a d e r s S t r e a m f l o w w o h e a d e r s S t r e a m f l o w s u p e r p a g e s H o a r d T c m a l l o c M i c h a e l G l i b c

4 5 T h r e a d s

6

7

70

60

Execution Time (sec.)

20

30

40

50

8

10

  • 0

    1

2

3

Barnes

4 5 T h r e a d s

S t r e a m f l o w S t r e a m f l o w S t r e a m f l o w H o a r d T c m a l l o c M i c h a e l G l i b c

6

headers wo headers superpages

45

40

35

7

Execution Time (sec.)

30

25

8

20

15

10

5

  • 0

    1

S t r e a m f l o w S t r e a m f l o w S t r e a m f l o w H o a r d T c m a l l o c M i c h a e l G l i b c

2

3

MPCDM

headers wo headers superpages

4 5 T h r e a d s

6

7

8

Figure 3. Execution time (lower is better) or throughput (higher is better) attained by different allocators.

Barnes:

Hood implementation of

the N-body Barnes-Hut

force

calculation algorithm [1]. Barnes has only limited sensitivity to al- location latency, particularly during the first iteration (time step) of the code, in which the main application data structures are created and initialized. The benchmark provides limited opportunities for

exploiting spatial and temporal locality.

Streamflow improves the execution time of Barnes 4.9% on average over glibc, 3.6% over Tcmalloc and 4.3% over Hoard. Barnes, however, is the only application in which Michael’s lock- free allocator outperforms Streamflow by 2.6% to 4.6% (3.6% on avg.).

The low intensity of memory management operations limits the performance improvements that can be attained by using different memory allocators. It should be noted, though, that the use of superpages by Streamflow, yields a 2% performance improvement, and a 13% reduction of minor page faults (from 9.1K to 7.9K).

MPCDM:

This is a guaranteed-quality multithreaded mesh gen-

erator based on the Delaunay method [5]. For realistic problem sizes, it allocates hundreds of millions of small objects (35 bytes on average) which represent triangles and points in a mesh. The al- gorithm deletes triangles that do not satisfy quality criteria set by the user, as well as some of their neighbors. The resulting empty area is then re-triangulated. After the re-triangulation it consists of at least as many triangles as those deleted. As a result the memory footprint of the application increases monotonically. The applica- tion offers opportunities for temporal and spatial locality optimiza- tions, stresses allocator memory reuse, and is sensitive to memory operation latency and allocator scalability. MPCDM’s scalability is limited by the frequent synchronization between its threads. It can

serve as a case study of the extent of benefits that can be attained by efficient memory allocators, in the presence of other bottlenecks, unrelated to memory management.

The fully optimized Streamflow implementation outperforms glibc by 18% to 45% (32% on avg.). The improvements against Hoard and Michael’s lock-free allocator range between 12% and 50% (22% and 42% on average respectively). Streamflow also per- forms up to 88% better than Tcmalloc (36% on average). It is clear that Streamflow can benefit complex scientific applications with in- tense memory management requirements, such as MPCDM. Espe- cially in the presence of frequent, application-induced synchroniza- tion operations, Streamflow’s mostly synchronization-free design practically eliminates additional, allocator induced contention be- tween threads.

The elimination of headers allows more small objects to be placed inside a single memory page. It thus favors spatial locality at the page-level, reducing minor page faults by 49% (from 247K to 127K) in the 8-threads execution. This is reflected in a 4% perfor- mance improvement over the base Streamflow implementation. The use of superpages has similar effects. Minor page faults are limited to just 888 and performance improves by 6% compared with the base implementation.

4.2.3 Memory overhead

An important metric for the quality a multithreaded memory alloca- tor is the memory overhead it introduces, quantified by the amount of virtual memory reserved by the allocator for a given stream of memory requests by the application. Table 2 shows the maximum virtual memory footprint of the seven benchmarks when executed with all allocators. The memory usage of each application was mea-

Document info
Document views34
Page views34
Page last viewedWed Dec 07 14:59:27 UTC 2016
Pages11
Paragraphs523
Words11492

Comments