X hits on this document

PDF document

Tradeoff between Data-, Instruction-, and Thread-Level Parallelism in Stream Processors - page 10 / 12





10 / 12

Relative run time

1.2 1 0.8 0.6 0.4 0.2 0

(1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2)

(1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2)

(1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2)

(1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2)

(1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2)

(1, 16, 4) (2, 8, 4) (16, 1, 4) (1, 32, 2) (2, 16, 2) (4, 8, 2)







Figure 7: Relative run time normalized to the (1, 16, 4) configuration of the 6 applications comparing a 1MB and 8MB SRF.

initiation interval (II) achieved for 3 critical kernels. We can clearly see the scheduling advantage of the 2-ILP configuration as the II of the 2-ILP configurations is significantly lower than twice the II achieved with 4 ALUs. This leads to the 15% and 8.3% performance improvements observed for FFT3D and StreamMD respectively.


TLP Scaling

The addition of TLP to the Stream Processor has both ben- eficial and detrimental effects on performance. The benefits arise from performing irregular control for the applications that require it, and from the ability to mask transient memory sys- tem stalls with useful work. The second effect plays an impor- tant role in improving the performance of the irregular stream- program control of StreamFEM. The detrimental effects are due to the partitioning of the inter-cluster switch into sequencer groups, and the synchronization overhead resulting from load imbalance.

First, we note that the addition of TLP has no impact on CONV2D beyond memory system sensitivity. We expected synchronization overhead to reduce performance slightly, but because the application is regular and compute bound, the threads are well balanced and proceed at the same rate of ex- ecution. Therefore, the synchronization cost due to waiting on threads to reach the barrier is minimal and the effect is masked by memory system performance fluctuation.

The performance of MATMUL strongly depends on the size

of the sub-matrices processed. For each sub-matrix, the compu- t a t i o n r e q u i r e s O ( N 3 s u b ) fl o a t i n g p o i n t o p e r a t i o n s a n d words, where each sub-matrix is Nsub × Nsub elements. There- fore, the larger the blocks the more efficient the algorithm. With the baseline configuration, which supports direct inter- cluster communication, the sub-matrices can be partitioned across the entire SRF within each sequencer group. When the TLP degree is 1 or 2, the SRF can support 64 × 64 blocks, but only 32 × 32 blocks for a larger number of threads. In all our configurations, however, the memory system throughput was sufficient and the performance differences are due to a more subtle reason. The kernel computation of MATMUL requires a large number of SRF accesses in the VLIW schedule. When the sub-matrices are partitioned across the SRF, some references are serviced by the inter-cluster switch instead of the SRF. This can most clearly be seen in the (4,8,2) configuration, in which the blocks are smaller and therefore data is reused less requiring more accesses to the SRF. The same problem does not occur in the (16,1,4) configuration because it provides higher SRF O ( N 2 s u b )

bandwidth in order to support the memory system throughput (please refer to discussion of stream buffers in Sec. 3). The left side of Fig. 8 presents the run time results of MATMUL when the inter-cluster switch is removed, such that direct communi- cation between clusters is not permitted, and for both a 1MB and a 8MB SRF. Removing the switch prevents sharing of the SRF and forces 16×16 sub-matrices with the smaller SRF size, with the exception of (16,1,4) that can still use a 32 × 32 block. As a result, performance degrades significantly. Even when SRF capacity is increased, performance does not match that of (16,1,4) because of the SRF bandwidth issue mentioned above. The group of bars with Gsb = 2 increases the SRF bandwidth by a factor of two for all configurations, and coupled with the larger SRF equalizes the run time of the applications if memory system sensitivity is ignored.

We see two effects of TLP scaling on the performance of FFT3D. First, FFT3D demonstrates the detrimental synchro- nization and load imbalance effect of adding threading support. While both the (1,16,4) and (1,32,2) configurations, which op- erate all clusters in lockstep, are memory bound, at least one sequencer group in all TLP-enabled configurations is busy at all times.

The stronger trend in FFT3D is that increasing the number of threads reduces the performance of the 2-ILP configurations. This is a result of the memory access pattern induced, which includes a very large stride when processing the Z dimension of the 3D FFT. Introducing more threads changes the blocking of the application and limits the number of consecutive words that are fetched from memory for each large stride. This effect has little to do with the control aspects of the ALUs and is specific to FFT3D.

StreamFEM demonstrates both the effectiveness of dynam- ically masking unexpectedly long memory latencies and the detrimental effect of load imbalance. The structure of the StreamFEM application limits the degree of double buffering that can be performed in order to tolerate memory latencies. As a result, computation and memory transfers are not perfectly overlapped, leading to performance degradation. We can see this most clearly in the (1, 32, 2) configuration of StreamFEM in Fig. 6, where there is a significant fraction of time when all clusters are idle waiting on the memory system (white portion of bar). When multiple threads are available this idle time in one thread is masked by execution in another thread, reducing the time to solution and improving performance by 7.4% for the (2, 16, 2). Increasing the number of threads further re- duces performance by introducing load imbalance between the

Document info
Document views17
Page views17
Page last viewedSun Oct 23 06:49:04 UTC 2016