X hits on this document

PDF document

Tradeoff between Data-, Instruction-, and Thread-Level Parallelism in Stream Processors - page 9 / 12

28 views

0 shares

0 downloads

0 comments

9 / 12

Application

Type

CONV2D

regular

MATMUL

regular

FFT3D

regular

StreamFEM

(ir)regular

StreamMD

irregular

StreamCDP

irregular

Description 2-dimensional 5 × 5 convolution on a 2048 × 2048 dataset [14]. blocked 512 × 512 matrix matrix multiply. 128 × 128 × 128 3D complex FFT performed as three 1D FFTs. streaming finite-element-method code for magnetohydrodynamics flow equations operating on a 9, 664 element mesh [10]. molecular dynamics simulation of a 4, 114 molecule water system [11]. finite volume large eddy flow simulation of a 29, 096 element mesh [10].

Table 3: Applications used for performance evaluation.

Relative run time

1.2 1 0.8 0.6 0.4 0.2 0

(1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2)

(1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2)

(1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2)

(1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2)

(1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2)

(1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2)

CONV2D

MATMUL FFT3D StreamFEM

StreamMD StreamCDP

a l l _ S E Q _ b u s y

some

_

SEQ

_

busy_MEM

_

busy

no

_

SEQ

_

busy_MEM

_

busy

s o m e _ S E Q _ b u s y _ M E M _ i d l e

Figure 6: Relative run time normalized to the (1, 16, 4) configuration of the 6 applications on 6 ALU organizations.

CONV2D

85.8 GFLOPS

31.1 GB/s

MATMUL

117.3 GFLOPS

15.6 GB/s

FFT3D

48.2 GFLOPS

44.2 GB/s

StreamFEM

65.5 GFLOPS

20.7 GB/s

StreamMD

50.2 GFLOPS

36.3 GB/s

StreamCDP

10.5 GFLOPS

39.1 GB/s

Peak

128 GFLOPS

64 GB/s

Application

Computation

Memory BW

Table 4: Summary of application performance on base- line (1-TLP, 16-DLP, 4-ILP) configuration

The six configurations were chosen to allow us to understand and compare the effects of controlling the ALUs with different parallelism dimensions. The total number of ALUs was kept constant at 64 to allow direct performance comparison. The degree of ILP (ALUs per cluster) is chosen as either 2 or 4 as indicated by the analysis in Sec. 5, and the amount of TLP (number of sequencer groups) was varied from 1 to 16. The DLP degree was chosen to bring the total number of ALUs to 64.

Overall, we see that with the baseline configuration, the amount of ILP utilized is more critical to performance than the number of threads used. In FFT3D choosing a 2-ILP con- figuration improves performance by 15%, while the gain due to multiple threads peaks at 7.4% for StreamFEM.

We also note that in most cases, the differences in perfor- mance between the configurations are under 5%. After careful evaluation of the simulation results and timing we conclude that much of that difference is due to the sensitivity of the memory system to the presented access patterns [2]. Below we give a detailed analysis of the clear performance trends that are inde- pendent of the memory system fluctuations. We also evaluate the performance sensitivity to the SRF size and availability of inter-cluster communication.

6.3

ILP Scaling

Changing the degree of ILP (the number of ALUs per cluster) affects performance in two ways. First, the SRF capacity per cluster grows with the degree of ILP, and second, the VLIW kernel scheduler performs better when the number of ALUs it manages is smaller.

Because the total number of ALUs and SRF capacity is con- stant, a 2-ILP configuration has twice the number of clusters of a 4-ILP configuration. Therefore, a 2-ILP configuration has half the SRF capacity per cluster. The SRF size per cluster can influence the degree of locality that can be exploited and reduce performance. We see evidence of this effect in the run time of CONV2D. CONV2D processes the 2048 × 2048 input set in 2D blocks, and the size of the blocks affects performance. Each 2D block has a boundary that must be processed sepa- rately from the body of the block, and the ratio of boundary elements to body elements scales roughly as 1/N (where N is the row length of the block). The size of the block is de- termined by the SRF capacity in a cluster and the number of clusters. With a 128KB SRF, and accounting for double buffer- ing and book-keeping data, the best performance we obtained for a 2-ILP configuration uses 16 × 16 blocks. With an ILP of 4 the blocks can be as large as 32 × 32 leading to an 11.5% performance advantage.

Looking at Fig. 7 we see that increasing the SRF size to 8MB reduces the performance difference to 1%, since the pro- portion of boundary elements decreases and it is largely due to imperfect overlap between kernels and memory transfers at the beginning and the end of the simulation.

The effectiveness of the VLIW kernel scheduler also plays an important role in choosing the degree of ILP. Our register orga- nization follows the stream organization presented in [30], and uses a distributed VLIW register file with an LRF attached di- rectly to each ALU port. As a result, the register file fragments when the number of ALUs is increased, reducing the effective- ness of the scheduler. Tab. 5 lists the software pipelined loop

Document info
Document views28
Page views28
Page last viewedTue Dec 06 20:17:29 UTC 2016
Pages12
Paragraphs566
Words11241

Comments