X hits on this document

PDF document

Tradeoff between Data-, Instruction-, and Thread-Level Parallelism in Stream Processors - page 7 / 12

26 views

0 shares

0 downloads

0 comments

7 / 12

Component

Equation

COMMs per cluster ITERs per cluster FUs per cluster External Cluster Ports COMM bit width Sequencer area SRF area per cluster Intra-cluster switch area Cluster area Inter-cluster switch area Total area

N b A A A A A A C O M M N I T E R N F U = N + COMM SEQ SRF SW CL COMM T OT N I T E R P = G e = L C = SSEQ = (1 + G N C ) ) = Cb = T (C(A ( w + LN N = (b + log2(S N C IT ER O M M ( I 0 SRF N N LRF COMM SRF +I )S = (C = 1 ? 0 : GCOMM +A F U b + 2 h + (Nw C(b A SRF s b b SRF N+ NCOMM P NA )A e m a x ( G s b COMM N , L = N F U N F U SRAM SRAM b ) ALU N ) b + h + 2w ( 2 + h(w C + 2A +A nonALU N F U ALU N F U ( + 3 N F U = N F U + ( N I T E R 2w + N C O M M ) C L + A S SEQ LRF +w )+ )w ) LRF R F nonALU ) C AG L ) /C) + A C O )h + A M M SW

b + h + wALU

  • +

    wLRF

)Peb

Table 2: Summary of VLSI area cost models

128

128

64

4

64

4

Number of ALUs per cluster

32

16

8

4

2

Number of ALUs per cluster

32

16

8

4

2

1.4

1.4

2

1.2

2

1.2

1

1.1 1.05

1

1.1 1.05

1

2

4

8 16 Number of clusters

32

64

128

1

2

4

8 16 Number of clusters

32

64

128

(a) 64-bit datapath

(a) 64-bit datapath

128

128

64

4

64

4

Number of ALUs per cluster

32

16

8

4

2

Number of ALUs per cluster

32

16

8

4

2

1.4

1.4

2

1.2

2

1.2

1

1.1 1.05

1

1.1 1.05

1

2

4

8 16 Number of clusters

32

64

128

1

2

4

8 16 Number of clusters

32

64

128

(b) 32-bit datapath

(b) 32-bit datapath

Figure 3: Relative area per ALU normalized to optimal ALU organization with the baseline configuration: (8- DLP,4-ILP) and (16-DLP,4-ILP) for 64-bit and 32-bit datapaths respectively.

Figure 4: Relative area per ALU normalized to opti- mal ALU organization for configurations with no inter- cluster switch: (128-DLP,4-ILP).

tion can always be performed through memory. Fig. 4 shows the area overhead heatmaps for configurations with no inter- cluster communication. The area per ALU without a switch im- proves with the amount of DLP utilized, but all configurations with more than 8 clusters fall within a narrow 5% area overhead range. The relative cost of adding sequencers is larger when no inter-cluster switch is provided because partitioning the con- trol does not reduce the inter-cluster switch area. Thus, the single-cluster sequencer group configurations have an overhead of 27% and 86% for a 64-bit and a 32-bit datapath respectively.

Note that the area-optimal configurations with no inter-cluster switch are 9% and 13% smaller than the area-optimal baseline configurations for 64-bit and 32-bit datapaths respectively.

We evaluate two extreme configurations of no inter-cluster communication and a fully-connected switch. Intermediate trade- off points include lower bandwidth and sparse interconnect structures, which will result in overhead profiles that are in between the heatmaps shown in Fig. 3 and Fig. 4.

Fig. 5 shows the sensitivity of the results presented above to the amount of on-chip storage in the sequencer instruction store (Fig. 5(a)) and SRF (Fig. 5(b)). The results are presented

Document info
Document views26
Page views26
Page last viewedMon Dec 05 12:39:10 UTC 2016
Pages12
Paragraphs566
Words11241

Comments