vectors of four 32-bit operands each. As mentioned in Sec. 3, we do not directly evaluate vector-style SIMD in this paper and focus on an architecture where SIMD clusters can access independent SRF addresses.
Scaling of Parallelism
Many numerical algorithms and applications have parallelism that scales with the dataset size. Examples include n-body sys- tem simulation where interactions are calculated for each par- ticle, processing of pixels in an image, images within a video stream, and points in a sampled signal. In such applications the parallelism can typically be cast into all three parallelism dimensions. Multiple threads can process separate partitions of the dataset, multiple clusters can work simultaneously on different elements, and loop unrolling and software pipelining transform data parallelism into ILP. The resulting performance depends on multiple factors such as the regularity of the con- trol, load balancing issues, and inherent ILP as will be discussed in Sec. 6.
On the other hand, some numerical algorithms place a limit on the amount of parallelism, reducing scalability. For exam- ple, the deblocking filter algorithm of H.264 is limited in paral- lelism to computations within a 4 × 4 block because of a data dependency between blocks . This leads to fine-grained se- rialization points of control, and the application may be more amenable to a space-multiplexed TLP mapping. The mapping of this type of limited-parallelism algorithm to highly parallel systems is a topic of active research and its evaluation is beyond the scope of this paper.
All control decisions within a regular control portion of an al- gorithm can be determined statically. A typical case of regular control is a loop nest with statically known bounds. Regu- lar control applications are very common and examples include convolution, FFT, dense matrix multiplication, and fixed de- gree meshes in scientific codes. Such algorithms can be easily mapped onto the DLP dimension and executed under SIMD- type control, and can also be cast into ILP and TLP. Convert- ing DLP into TLP incurs overheads related to synchroniza- tion of the threads due to load imbalance. Casting DLP as ILP increases pipeline scheduling overheads related to software pipelining and loop unrolling, that are required to effectively utilize many ALUs. As a result, we expect that a high SIMD degree is necessary to achieve best performance on regular con- trol applications.
In code with irregular control, decisions on control flow are made based on runtime information and cannot be determined statically. For example, when processing an element mesh the number of neighbors each element has may vary, leading to irregular control. Mapping irregular algorithms to the TLP dimension is simple as each sequencer follows its own control path. Mapping along the DLP dimension onto SIMD hardware is more challenging, and is discussed in detail in prior work [10, 15]. In  results based on an analytic performance model of a threaded stream architecture suggest that the potential of utilizing TLP can be as high as 30% over a DLP-ILP only configuration.
We evaluate the benefits and overheads of utilizing paral- lelism along the three axes for both regular and irregular ap- plications in Sec. 6.
VLSI AREA MODELS
Our hardware cost model is based on an estimate of the area of each architectural component for the different organizations normalized to a single ALU. We focus on area rather than en- ergy and delay for two reasons. First, as shown in , the energy of a stream processor is highly correlated to area. Sec- ond, a Stream Processor is designed to efficiently tolerate laten- cies by relying on software and the locality hierarchy, thereby reducing the sensitivity of performance to pipeline delays. Fur- thermore, as we will show below, near optimal configurations fall within a narrow range of parameters limiting the dispar- ity in delay between desirable configurations. In this paper we only analyze the area of the structures relating to the ALUs and their control. The scalar core and memory system per- formance must scale with the number and throughput of the ALUs and do not depend on the ALU organization. Based on the implementation of Imagine  and the design of Merri- mac we estimate that the memory system and the scalar core account for roughly 40% of a Stream Processor’s die area.
The area model follows the methodology developed in  adapted to the design of the Merrimac processor [7,14] with the additional MIMD mechanism introduced in this paper. Tab. 1 and Tab. 2 summarize the parameters and equations used in the area cost model. The physical sizes are given in technology independent track and grid units, and are based on an ASIC flow. A track corresponds to the minimal distance between two metal wires at the lowest metal layer. A grid is the area of a (1 × 1) track block. Because modern VLSI designs tend to be wire limited, areas and spans measured in grids and tracks scale with VLSI fabrication technology.
We only briefly present the parameters and equations of the area model and a more detailed description can be found in .
The first parameter in Tab. 1 is the data width of the archi- tecture, and we show results for both 64-bit and 32-bit arith- metic. The second set of parameters corresponds to the per-bit areas of storage elements, where the GSRF parameter accounts for the multiplexers and decoders necessary for the SRF struc- tures. The third group relates to the datapath of the Stream Processor and gives the widths and heights of the functional units and LRF.
Because the stream architecture is strongly partitioned, we only need to calculate the area of a given sequencer group to draw conclusions on scalability, as the total area is simply the product of the area of a sequencer group and the number of sequencers.
The main components of a sequencer group are the sequencer itself, the SRF, the clusters of functional units, LRF, and intra- cluster switch. A sequencer may also contain an inter-cluster switch to interconnect the clusters.
The sequencer includes the SRAM for storing the instruc- tions and a datapath with a simple and narrow ALU (similar in area to a non-ALU numerical functional unit) and registers. The instruction distribution network utilizes high metal layers and does not contribute to the area.
In our architecture, the SRF is banked into lanes such that each cluster contains a single lane of the SRF, as in Imagine and Merrimac. The capacity of the SRF per ALU is fixed (SSRF ), but the distribution of the SRF array depends on C and N. The SRF cost also includes the hardware of the stream buffers. The amount of buffering required depends on both the number of ALUs within a cluster, and the bandwidth supplied by the wide SRF port. The number of SBs is equal to the number of external ports in a cluster Pe. The number of ports