X hits on this document

PDF document

Tradeoff between Data-, Instruction-, and Thread-Level Parallelism in Stream Processors - page 4 / 12

36 views

0 shares

0 downloads

0 comments

4 / 12

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

Instruction Path

Instruction Sequencer

Global Switch

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

Instruction Sequencer

ALU

ALU

ALU

ALU

Local Switch

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

Instruction Sequencer

Instruction Sequencer

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

(a) SIMD

(b) SIMD+VLIW

(c) SIMD+VLIW+MIMD

Figure 2: ALU organization along the DLP, ILP, and TLP axes with SIMD, VLIW, and MIMD mechanisms respectively.

scope of this paper. It is possible to design an interconnect across multiple sequencer groups and partition it in software to allow a reconfigurable TLP option, and the hardware cost tradeoff of a full vs. a partitioned switch is discussed in Sec. 5. Another option is to communicate between sequencer groups using coarser-grained messages that amortize synchronization costs. For example, the Cell processor uses DMA commands to transfer data between its 8 sequencer groups (SPEs in Cell terminology) [28].

We do not evaluate extending a Stream Processor in the TLP dimension by adding virtual contexts that share the existing hardware as suggested for other architectures in [21, 36]. Un- like other architectures a Stream Processor relies heavily on ex- plicitly expressing locality and latency hiding in software and exposes a deep bandwidth/locality hierarchy with hundreds of registers and megabytes of software controlled memory. Repli- cating such a large context is infeasible and partitioning the private LRFs reduces software’s ability to express locality.

4.

IMPACT OF APPLICATIONS ON ALU CON- TROL

In order to understand the relationship between applica- tion properties and the mechanisms for ALU control and com- munication described in Sec. 3 we characterize numerically- intensive applications based on the following three criteria. First, whether the application is throughput-oriented or presents real-time constraints. Second, whether the parallelism in the application scales with the dataset or is fixed by the numerical algorithm. Third, whether the application requires regular or irregular flow within the numerical algorithm. Regular con- trol corresponds to loops with statically determined bounds, and irregular control implies that the work performed is data dependent and can only be determined at runtime.

4.1

Throughput vs. Real-Time

In throughput-oriented applications, minimizing the time to solution is the most important criteria. For example, when multiplying large matrices, simulating a complex physical sys- tem as part of a scientific experiment, or in offline signal and image processing, the algorithms and software system can em- ploy a batch processing style. In applications with real-time constraints, on the other hand, the usage model restricts the latency allowed for each sub-computation. In a gaming envi-

ronment, for instance, the system must continuously respond to user input while performing video, audio, physics, and AI tasks.

The Imagine and Merrimac Stream Processors, as well as the similarly architected CSX-600, are designed for throughput- oriented applications. The ALUs in these processors are con- trolled with SIMD-VLIW instructions along the ILP and DLP dimensions only. The entire on-chip state of the processor, in- cluding the SRF (1MB/576KB in Merrimac/CSX600) and LRF (64KB/12KB), is explicitly managed by software and controlled with a single thread of execution. The software system exploits locality and parallelism in the application to utilize all proces- sor resources. This organization works well for a large number of applications and scales to hundreds of ALUs [7,18]. However, supporting applications with real-time constraints can be chal- lenging. Similarly to a conventional single-threaded processor, a software system may either preempt a running task due to an event that must be processed, or partition tasks into smaller subtasks that can be interleaved to ensure real-time goals are achieved.

Performing a preemptive context switch requires that the SRF be allocated to support the working set of both tasks. In addition, the register state must be saved and restored po- tentially requiring over 5% of SRF capacity and consuming hundreds of cycles due to the large number of registers. It is possible to partition the registers between multiple tasks to avoid this overhead at the expense of a smaller register space that can be used to exploit locality.

Subdividing tasks exposes another weakness of throughput- oriented designs. The aggressive static scheduling required to control the Stream Processor’s ALUs often results in high task startup costs when software pipelined loops must be primed and drained. If the amount of work a task performs is small, because it was subdivided to ensure interactive constraints for example, these overheads adversely affect performance.

Introducing MIMD to support multiple concurrent threads of control can alleviate these problems, by allocating tasks spa- tially across the sequencer groups. In this way resources are dedicated to specific tasks leading to predictable timing. The Cell processor, which is used in the Sony PlayStation 3 gaming console, takes this approach. Using our terminology a Cell is organized as 8 sequencer groups with one cluster each, and one non-ALU functional unit and one FPU per cluster. The func- tional units operate on 128-bit wide words representing short

Document info
Document views36
Page views36
Page last viewedThu Jan 19 09:58:00 UTC 2017
Pages12
Paragraphs566
Words11241

Comments