Figure 1: Canonical Stream Processor architecture
rise as distance grows.
The bandwidth hierarchy  consists of the DRAM inter- faces that process off-chip communication, the SRF that serves as a staging area for the bulk transfers to and from DRAM and utilizes high bandwidth on-chip structures, and the LRFs that are highly partitioned and tightly connected to the ALUs in order to support their bandwidth demands. The same or- ganization supports a hierarchy of locality. The LRFs exploits short term producer-consumer locality within kernels while the SRF targets producer-consumer locality between kernels and is able to capture a working set of the application. To facilitate effective utilization of the bandwidth/locality hierarchy each level has a separate name space. The LRFs are explicitly ad- dressed as registers by ALU instructions. The SRF is arbitrar- ily addressable on-chip memory that is banked to support high bandwidth and may provide several addressing modes. The SRF is designed with a wide single ported SRAM for efficiency, and a set of stream buffers (SBs) are used to time-multiplex the SRF port . The memory system maintains the global DRAM address space.
The general purpose core executes scalar instructions for overall program control and dispatches coarse-grained stream instructions to the memory system and instruction sequencer. Stream Processors deal with more coarsely grained instructions in their control flow than both superscalar and vector proces- sors, significantly alleviating the von Neumann bottleneck . The DMA engines in the memory system and the ALU in- struction sequencers operate at a finer granularity and rely on decoupling for efficiency and high throughput. In a Stream Pro- cessor the ALUs can only access their private LRF and the SRF and rely on the higher level stream program to transfer data between memory and the SRF. Therefore, the SRF decouples the ALU pipeline from the unpredictable latencies of DRAM accesses enabling aggressive and effective static scheduling and a design with lower hardware cost compared to out-of-order ar- chitectures. Similarly, the memory system only handles DMA requests and can be optimized for throughput rather than la- tency.
In addition to the ALUs, a Stream Processor provides non- ALU functional units to support specialized arithmetic opera- tions and to handle inter-PE communication. We assume that at least one ITER unit is available in each PE to accelerate iterative divide and square root operations; and that one or more COMM functional units are provided for inter-PE com- munication when applicable. The focus of this paper is on the organization and scaling of the ALUs along the ILP, DLP, and TLP axes of control as described below.
We now describe how control of the ALUs can be structured along the different dimensions of parallelism and discuss the im- plications on synchronization and communication mechanisms.
A DLP organization of ALUs takes the form of a single in- struction sequencer issuing SIMD instructions to a collection of ALUs. The ALUs within the group execute the same in- structions in lockstep on different data. Unlike with vectors or wide-word arithmetic, each ALU can potentially access dif- ferent SRF addresses. In this organization an optional switch can be introduced to connect the ALUs. Fig. 2(a) shows how all ALUs receive the same instruction (the “white” instruc- tion) from a single instruction issue path and communicate on a global switch shared between the group of ALUs. Be- cause the ALUs within the group operate in lockstep, simple control structures can utilize the switch for direct exchange of words between the ALUs, to implement the conditional streams mechanism for efficient data-dependent control operations , and to dynamically access SRF locations across several SRF banks .
Another possible implementation of DLP in a Stream Proces- sor is to use short-vector ALUs that operate on wide words as with SSE , 3DNOW! , and Altivec . This approach was taken in Imagine for 8-bit and 16-bit arithmetic within 32-bit words, on top of the flexible-addressing SIMD, to pro- vide two levels of DLP. We chose not to evaluate this option in this paper because it significantly complicates programming and compilation and we explore the DLP dimension using clus-
To introduce ILP into the DLP configuration the ALUs are partitioned into clusters (in this organization clusters corre- spond to PEs). Within a cluster each ALU receives a different operation from a VLIW instruction. DLP is used across clusters by using a single sequencer to issue SIMD instructions as be- fore, but with each of these instructions being VLIW. Fig. 2(b) shows an organization with four clusters of four ALUs each. In each cluster a VLIW provides a “black”, a “dark gray”, a “light gray”, and a “white” instruction separately to each of the four ALUs. The same set of four instructions feeds the group of ALUs in all clusters (illustrated with shading in Fig. 2(b)).
In this SIMD/VLIW clustered organization the global switch described above becomes hierarchical. An intra-cluster switch connects the ALUs and LRFs within a cluster, and is statically scheduled by the VLIW compiler. The clusters are connected with an inter-cluster switch that is controlled in the same man- ner as the global DLP switch. This hierarchical organization provides an area-efficient and high bandwidth interconnect.
To address TLP we provide hardware MIMD support by adding multiple instruction sequencers and partitioning the inter-cluster switch. As shown in Fig. 2(c), each sequencer con- trols a sequencer group of clusters (in this example four clusters of two ALUs each). Within each cluster the ALUs share an intra-cluster switch, and within each sequencer group the clus- ters can have an optional inter-cluster switch. The inter-cluster switch can be extended across multiple sequencers, however, doing so requires costly synchronization mechanisms to ensure that a transfer across the switch is possible and that the data is coherent. A discussion of such mechanisms is beyond the