APU Floating-Point Unit v3.1

# There are three ways to discover independent instructions:

Hardware discovery usually requires support for out-of-order execution—not provided by the current implementation of the FPU. Discovery by the compiler requires an optimizing compiler. The current implementation of gcc is capable of performing some basic instruction scheduling functions, but does not perform advanced loop transformations. Therefore, to obtain highest performance, the program- mer must sometimes be prepared to rephrase the algorithm.

Example: Parallelism in a FIR Filter Consider the following simple piece of code to implement a FIR filter:

for (j = 0; j < nsamples; j++) { x = input++; accum = 0.0; for (i = 0; i < ntaps; i++) accum += coeffs[i] * *x--; *output++ = accum;

}

The value of the accumulator resulting from iteration N of the inner loop will be required again as an input to iteration N+1. This means that execution of the multiply–add instructions cannot be allowed to overlap. To improve matters, it is possible to use two accumulators: one for odd taps and one for even taps. The code will then look similar to this:

for (j=0; j < nsamples; j++) { x = input++; accum1 = accum2 = 0.0; for (i = 0; i < ntaps; i+=2) { accum1 += coeffs[i] * *x--; accum2 += coeffs[i+1] * *x--; } *output++ = accum1+accum2;

}

The two multiply–add operations within the inner loop can now be performed in parallel, and the FPU allows them to proceed through the pipeline together. Performance rates will be better, but still lower than the maximum available. Note that these two are equivalent only if the number of taps is even. The two functions may also give subtly different results because the order in which they perform the addi- tions differs.

# Optimization of data movement

The parallel-accumulator approach described above can be extended to use a number of accumulators to provide enough independent operations to fill the pipeline. However, there is another problem with this method. Each multiply-add instruction requires two new pieces of data to be loaded into float- ing-point registers, further limiting the number of parallel operations. Only one in three of the issued instructions is doing useful work.

To remedy this, look at the outer loop as well as the inner loop. If multiple samples can be processed in parallel, then each input sample can be reused once it has been read into the floating-point register file. Consider the equations below, which give the calculations required for 8 samples of an 8-tap FIR filter:

12

www.xilinx.com

March 11, 2008 Product Specification