32, 37]. The study presented here is a fair exploration of the cost-performance tradeoff and scaling for the three parallelism dimensions using a single execution model and algorithm for all benchmarks.
The main hardware cost of TLP is in additional instruction sequencers and storage. Contrary to our intuition, the cost analysis indicates that the area overhead of supporting TLP is in some cases quite low. Even when all PEs are operated in MIMD, and each has a dedicated instruction sequencer, the area overhead is only 15% compared to our single threaded baseline. When the architectural properties of the baseline are varied, the cost of TLP can be as high as 86% Therefore, we advocate a mix of ILP, DLP, and TLP that keeps hardware overheads to less than 5% compared to a minimum-cost orga- nization. The performance evaluation also yielded interesting results in that the benefit of the added flexibility of exploiting TLP is hindered by increased synchronization costs. For the ir- regular applications we study, performance improvement of the threaded Stream Processor over a non-threaded version is lim- ited to 7.4%, far less than the 30% figure calculated in . We also observe that as the number of ALUs is scaled up beyond roughly 64 ALUs per chip, the cost of supporting communi- cation between the ALUs grows sharply, suggesting a design point with VLIW in the range of 2–4 ALUs per PE, SIMD across PEs in groups of 8–16 PEs, and MIMD for scaling up to the desired number of ALUs.
The main contributions of this paper are as follows:
We develop detailed processor-performance and hardware- cost models for a multi-threaded Stream Processor.
We study, for the first time, the hardware tradeoff of scal- ing the number of ALUs along the ILP, DLP, and TLP dimensions combined.
We evaluate the performance benefits of the additional flexibility allowed by multiple threads of control, and the performance overheads associated with the various hard- ware mechanisms that exploit parallelism.
In Sec. 2 we present prior work related to this paper. We de- scribe our architecture and the ILP/DLP/TLP hardware trade- off in Sec. 3. Sec. 4 discusses application properties and corre- sponding architectural choices. We develop the hardware cost model and analyze scaling in Sec. 5. Our experiments and per- formance evaluation appear in Sec. 6 and we conclude in Sec. 7. This work is described in more detail in Chapter 6 of .
In this section we describe prior work related to this paper. We focus on research that specifically explored scalability.
A large body of prior work addressed the scalability of gen-
eral purpose architectures targeted at the
sequential, single- includes work on
ILP architectures, such as [13, 26, 27], and speculative or grained threading as in [12, 31, 33]. Similar studies were ducted for cache-coherent chip multi-processors targeting a
fine- con- more
fine-grained cooperative threads execution model (e.g., ). In contrast, we examine a different architectural space in which parallelism is explicit in the programming model offering a widely different set of tradeoff options.
Recently, architectures that target several execution mod- els have been developed. The RAW architecture  scales along the TLP dimension and provides hardware for software controlled low overhead communication between PEs. Smart
Memories  and the Vector-Thread architecture  support both SIMD and MIMD control, but the scaling of ALUs fol- lows the TLP axis. TRIPS  can repartition the ALU control based on the parallelism axis utilized best by an application, however, the hardware is fixed and scalability is essentially on the ILP axis within a PE and via coarse-grained TLP across PEs. The Sony Cell Broadband EngineTM processor (Cell)  is a flexible Stream Processor that can only be scaled utilizing TLP. A feature common to the work mentioned above is that there is little evaluation of the hardware costs and tradeoffs of scaling using a combination of ILP, DLP, and TLP, which is the focus of this paper.
The hardware and performance costs of scaling the DLP di- mension were investigated in the case of the VIRAM archi- tecture for media applications . Media applications with regular control were also the focus of , which evaluated the scalability of Stream Processors along the DLP and ILP dimensions. We extend this work by incorporating TLP sup- port into the architecture and evaluating the tradeoff options of ALU control organizations for both regular- and irregular- control applications.
In this section we give an overview of the stream execution model and stream architecture and focus on the mechanisms and tradeoffs of controlling many ALUs utilizing a combination of ILP, DLP, and TLP. A more thorough description of stream architectures and their merits appears in [7, 17].
Stream Architecture Overview
We first present the generalized stream execution model, and then a stream architecture that is designed to exploit it.
3.1.1 Stream Execution Model
The stream execution model targets compute-intensive nu- merical codes with large amounts of parallelism, structured control, and memory accesses that can be determined well in advance of data use. Stream programs may follow a restricted synchronous data flow representation [22, 35] or a generalized gather–compute–scatter form [5, 16]. We use the generalized stream model that has been shown to be a good match for media, signal processing, and physical modeling scientific com- puting domains [7,17]. In the generalized form, coarse-grained, complex kernel operations are executed on collections of data elements, referred to as streams or blocks, that are transfered using asynchronous bulk operations. The generalized stream model is not restricted to sequentially processing all elements of the input data; instead, streaming is on a larger scale and refers to processing a sequence of blocks (streams). Relying on coarse-grained control and data transfer allows for less complex and more power- and area- efficient hardware, as delegating greater responsibility to software reduces the need to dynami- cally extract parallelism and locality.
3.1.2 Stream Processor Architecture
Fig. 1 depicts a canonical Stream Processor that consists of a simple general purpose control core, a throughput-oriented streaming memory system, on-chip local storage in the form of the stream register file (SRF), a set of ALUs with their as- sociated local register files (LRFs), and instruction sequencers. This organization forms a hierarchy of bandwidth, locality, and control mitigating the detrimental effects of distance in modern VLSI processes, where bandwidth drops and latency and power