X hits on this document

PDF document

Tradeoff between Data-, Instruction-, and Thread-Level Parallelism in Stream Processors - page 8 / 12

23 views

0 shares

0 downloads

0 comments

8 / 12

1.8

Area increase relative to baseline

1.6

1.4

1.2

1

512 1024 4096 8192

0.8

  • 0

    16

32

48 64 80 Number of clusters

96

112

128

(a) Sensitivity to instruction store capacity

2.2

Area increase relative to baseline

2

1.8

1.6

1.4

1.2

1

1024 4096 8192 16384

0.8

0

16

32

48 64 80 Number of clusters

96

112

128

(b) Sensitivity to SRF capacity

Figure 5: Relative increase of area per ALU as the number of clusters and capacity of storage structures is varied. Area is normalized to the baseline 64-bit configuration.

as a multiplicative factor relative to the baseline configuration shown in Fig. 3(a). In the case of instruction storage, there is a minor correlation between the area overhead and the degree of ILP due to VLIW word length changes. This correlation is not shown in the figure and is less than 1% for the near-optimal 2 4-ILP configurations. The area overhead of increasing the instruction store is a constant because it only changes when sequencer groups are added along the TLP dimension. There- fore, the overhead decreases sharply for a larger than baseline instruction store, and quickly approaches the baseline. Simi- larly, when the instruction store is smaller than the baseline, the area advantage diminishes as the number of clusters in- creases. The trends are different for changing the amount of data storage in the SRF as the SRF capacity scales with the number of ALUs and, hence, the number of clusters. As a re- sult, the area overhead for increasing the SRF size is simply the ratio of the added storage area relative to the ALU area for each configuration. The area per ALU has an optimal point near 8 16 clusters, and the SRF area overhead is maximal.

6.

EVALUATION

In this section we evaluate the performance tradeoffs of utiliz- ing ILP, DLP, and TLP mechanisms using regular and irregular control applications that are all throughput-oriented and have parallelism on par with the dataset size. The applications are summarized in Tab. 3.

6.1

Experimental Setup

We modified the Merrimac cycle accurate simulator [7] to

support multiple sequencer groups on a single chip. The mem- ory system contains a single wide address generator to trans- fer data between the SRF and off-chip memory. We assume a 1GHz core clock frequency and memory timing and parameters conforming to XDR DRAM [9]. The memory system also in- cludes a 256KB stream cache for bandwidth amplification, and scatter-add units, which are used for fast parallel reductions [1]. Our baseline configuration uses a 1MB SRF and 64 multiply- add floating point ALUs arranged in a (1-TLP,16-DLP,4-ILP) configuration along with 32 supporting iterative units, and al- lows for inter-cluster communication within a sequencer group at a rate of 1 word per cluster on each cycle.

The applications of Tab. 3 are written using a two-level pro- gramming model: kernels, which perform the computation re- ferring to data resident in the SRF, are written in KernelC and are compiled by an aggressive VLIW kernel scheduler [24], and stream level programs, which control kernel invocation and bulk stream memory transfers, are coded using an API that is executed by the scalar control processor. The scalar core dispatches the stream instructions to the memory system and sequencer groups.

6.2

Performance Overview

The performance characteristics of the applications on the baseline configuration with the 1-TLP, 16-DLP, 4-ILP (1, 16, 4) configuration of Merrimac are reported in Tab. 4. Both CONV2D and MATMUL are regular control applications that are compute bound and achieve a high fraction of the 128 GFLOPS peak performance, at 85.8 and 117.3 GFLOPS re- spectively. FFT3D is also characterized by regular control and regular access patterns but places higher demands on the mem- ory system. During the third FFT stage of computing the Z dimension of the data, the algorithm generates long strides (214 words) that limit the throughput of DRAM and reduce the overall application performance. StreamFEM has regular control within a kernel but includes irregular constructs in the stream-level program. This irregularity limits the degree to which memory transfer latency can be tolerated with software control. StreamFEM also presents irregular data access pat- terns while processing the finite element mesh structure. This irregular access limits the performance of the memory system and leads to bursty throughput, but the application has high arithmetic intensity and achieves over half of peak performance. StreamMD has highly irregular control and data access as well as a numerically complex kernel with many divide and square root operations. It achieves a significant fraction of peak per- formance at 50.2 GFLOPS. Finally, the performance of Stream- CDP is severely limited by the memory system. StreamCDP issues irregular access patterns to memory with little spatial locality, and also performs only a small number of arithmetic operations for every word transferred from memory (it has low arithmetic intensity). In the remainder of this section we will report run time results relative to this baseline performance.

Fig. 6 shows the run time results for our applications on 6 different ALU organizations. The run time is broken down into four categories: all sequencers are busy executing a kernel cor- responding to compute bound portion of the application; at least one sequencer is busy and the memory system is busy, in- dicating load imbalance between the threads, but performance bound by memory throughput; all sequencers are idle and the memory system is busy during memory bound portions of the execution; at least one sequencer is busy and the memory sys- tem is idle due to load imbalance between the execution control threads.

Document info
Document views23
Page views23
Page last viewedFri Oct 28 23:35:04 UTC 2016
Pages12
Paragraphs566
Words11241

Comments