b A A G w w w h G G G I I L L L S S C N T
SRAM sb SRF ALU nonALU LRF
COMM IT ER sb
0 N C N AG SRF SEQ
64/32 16 128 0.18 1754 350 281 2800 0.25 0.5 1 64 64 4 1 8 2048 2048 — — —
Data width of the architecture (bits) Area of a single ported SRAM bit used for SRF and instruction store (grids) Area of a dual ported stream-buffer bit (grids) Area overhead of SRF structures relative to SRAM bit Datapath width of a 64-bit ALU (tracks) Datapath width of non-ALU supporting functional unit (FU) (tracks) Datapath width of 64-bit LRF per functional unit (tracks) Datapath height for 64-bit functional units and LRF (tracks) COMM units required per ALU ITER units required per ALU Capacity of a half stream buffer per ALU (words) Minimal width of VLIW instructions for control (bits) Additional width of VLIW instructions per ALU/FU (bits) Initial number of cluster SBs Additional SBs required per ALU Bandwidth of on-chip memory system (words/cycle) SRF capacity per ALU (words) Instruction capacity per sequencer group (VLIW words) Number of clusters Number of ALUs per cluster Number of sequencers
Table 1: Summary of VLSI area parameters (ASIC flow). Tracks are the distance between minimal pitch metal tracks; grids are the area of a (1 × 1) track block.
has a minimal value LC required to support transfers between the SRF and the memory system and a term that scales with N. The capacity of each SB must be large enough to effec- tively time multiplex the single SRF port, which must support enough bandwidth to both feed the ALUs (scales as GsbN)
and saturate the memory system (scales as 2LAG
SRF bandwidth required per ALU (Gsb) will play a role in the performance of the MATMUL application described in Sec. 6.
The intra-cluster switch that connects the ALUs within a cluster and the inter-cluster switch that connects clusters use two dimensional grid structures to minimize area, delay, and energy as illustrated in Fig. 2. We assume fully-connected switches, but sparser interconnects may also be employed. The intra-cluster switch area scales as N2, but for small N the clus- ter area is dominated by the functional units and LRF, which scale with N but have a larger constant factor.
The inter-cluster switch can be used for direct communica- tion between the clusters, and also allows the clusters within a sequencer group to share their SRF space using the cross-lane indexed SRF mechanism . We use a full switch to inter- connect the clusters within a sequencer group and its grid or- ganization is dominated by the C2 scaling term. However, the dependence on the actual area of a cluster, the SRF lane, the
SBs, and the number of COMM units required (NCOMM
linearly with N) plays an important role when C is smaller.
configuration specified in Tab. 1 for both a 32-bit and a 64-bit datapath Stream Processor. The horizontal axis corresponds to the number of clusters in a sequencer group (C), the vertical axis to the number of ALUs within each cluster (N), and the shading represents the relative cost normalized to a single ALU of the optimal-area configuration. In the 64-bit case, the area optimal point is (8-DLP,4-ILP) requiring 1.48 × 107 grids per ALU accounting for all stream execution elements. This corre- sponds to 94.1mm2 in a 90nm ASIC process with 64 ALUs in a (2-TLP,8-DLP,4-ILP) configuration. For a 32-bit datapath, the optimal point is at (16-DLP,4-ILP).
Fig. 3 indicates that the area dependence on the ILP dimen- sion (number of ALUs per cluster) is much stronger than on DLP scaling. Configurations with an ILP of 2 − 4 are roughly equivalent in terms of hardware cost, but further scaling along the ILP axis is not competitive because of the N2 term of the intra-cluster switch and the increased instruction store capacity required to support wider VLIW instructions. Increasing the number of ALUs utilizing DLP leads to better scaling. With a 64-bit datapath (Fig. 3(a)), configurations in the range of 2−32 and 4 ALUs are within about 5% of the optimal area. When scaling DLP beyond 32 clusters, the inter-cluster switch area significantly increases the area per ALU. A surprising observa- tion is that even with no DLP, the area overhead of adding a sequencer for every cluster is only about 15% above optimal.
As the analysis presented above describes, the total area cost scales linearly with the degree of TLP, and is equal to the num- ber of sequencer groups (T ) multiplied by the area of a single se- quencer group. Therefore, to understand the tradeoffs of ALU organization along the different parallelism axes we choose a specific number of threads, and then evaluate the DLP and ILP dimension using a heatmap that represents the area of dif- ferent (C, N) design space points relative to the area-optimal point. Fig. 3 presents the tradeoff heatmap for the baseline
Both trends change when looking at a 32-bit datapath (Fig.3(b)). The cost of a 32-bit ALU is significantly lower, increasing the relative cost of the switches and sequencer. As a result only configurations within a 2 − 4 ILP and 4 − 32 DLP are competi- tive. Providing a sequencer to each cluster in a 32-bit architec- ture requires 62% more area than the optimal configuration.
Scaling along the TLP dimension limits direct cluster to clus- ter communication to within a sequencer group, reducing the cost of the inter-cluster switch, which scales as C2. The inter- cluster switch, however, is used for performance optimizations and is not necessary for the architecture because communica-