Tradeoff between Data-, Instruction-, and Thread-Level Parallelism in Stream Processors
Jung Ho Ahn Hewlett-Packard Laboratories Palo Alto, California
Mattan Erez University of Texas at Austin Austin, Texas
William J. Dally Stanford University Stanford, California
This paper explores the scalability of the Stream Processor ar- chitecture along the instruction-, data-, and thread-level paral- lelism dimensions. We develop detailed VLSI-cost and processor- performance models for a multi-threaded Stream Processor and evaluate the tradeoffs, in both functionality and hardware costs, of mechanisms that exploit the different types of parallelism. We show that the hardware overhead of supporting coarse- grained independent threads of control is 15 − 86% depending on machine parameters. We also demonstrate that the perfor- mance gains provided are of a smaller magnitude for a set of numerical applications. We argue that for stream applications with scalable parallel algorithms the performance is not very sensitive to the control structures used within a large range of area-efficient architectural choices. We evaluate the specific ef- fects on performance of scaling along the different parallelism dimensions and explain the limitations of the ILP, DLP, and TLP hardware mechanisms.
Categories and Subject Descriptors
C.1.3 [Processor Architectures]: Other Architecture Styles
Stream Processors, Aspect Ratio, DLP, ILP, TLP
The increasing importance of numerical applications and the properties of modern VLSI processes have led to a resurgence in the development of architectures with a large number of ALUs and extensive support for parallelism (e.g., [7, 17, 19, 21, 23, 28, 32, 37]). In particular, Stream Processors achieve area- and energy-efficient high performance by relying on the abun- dant parallelism, multiple levels of locality, and predictability
of data accesses common to media, signal processing, and scien- tific application domains. The term Stream Processor refers to the architectural style exemplified by Imagine , the Clear- Speed CSX600 , Merrimac , and the Cell Broadband En- gine (Cell) .
Stream Processors are optimized for the stream execution model  and not for programs that feature only instruction- level parallelism or that rely on fine-grained interacting threads. Stream Processors achieve high efficiency and performance by providing a large number of ALUs partitioned into processing elements (PEs), minimizing the amount of hardware dedicated to data-dependent control, and exposing a deep storage hierar- chy that is tuned for throughput. Software explicitly expresses multiple levels of parallelism and locality and is responsible for both concurrent execution and latency hiding .
In this paper we extend the stream architecture of Merrimac to support thread-level parallelism (TLP) on top of the mecha- nisms for data-level and instruction-level parallelism (DLP and ILP respectively). Previous work on media applications has shown that Stream Processors can scale to a large number of ALUs without employing TLP. However, we show that exploit- ing the TLP dimension can reduce hardware costs when very large numbers of ALUs are provided and can lead to perfor- mance improvements when more complex and irregular algo- rithms are employed.
In our architecture, ILP is used to drive multiple functional units within each PE and to tolerate pipeline latencies. We use VLIW instructions that enable a highly area- and energy- efficient register file organization , but a scalar instruction set is also possible as in . DLP is used to achieve high utilization of the throughput-oriented memory system and to feed a large number of PEs operated in SIMD fashion. To enable concurrent threads, multiple instruction sequencers are introduced, each controlling a group of PEs in MIMD fash- ion. Thus, we support a coarse-grained independent threads of control execution model. Each instruction sequencer supplies SIMD-VLIW instructions to its group of PEs, yielding a fully flexible MIMD-SIMD-VLIW chip architecture. More details are given in Sec. 3.
∗This work was supported, in part, by the Department of Energy ASC Alliances Program, Contract Lawrence Livermore National Labora- tory B523583, with Stanford University.
(c) ACM, 2007. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in the Proceedings of ICS’07 June 18-20, Seattle, WA, USA.
Instead of focusing on a specific design point we explore the scaling of the architecture along the three axes of parallelism as the number of ALUs is increased and use detailed models to measure hardware cost and application performance. We then discuss and simulate the tradeoffs between the added flexibility of multiple control threads, the overhead of synchronization and load balancing between these threads, and features such as fine grained communication between PEs that are only feasible with lockstep execution. Note that our stream architecture does not support multiple execution models as described in [21, 23,