APU Floating-Point Unit v3.1
00 = hx
1 -1 + hx
- ... 77 + +hx
01 = hx
10 + hx
- ... 76 + +hx
02 = hx
11 + hx
20 + hx
- ... 75 + +hx
+ . . . + h 7 x 0
+ h 2 x 5
+ h 1 x
= h 0 x 7
The terms hilighted in bold all use the first input sample (index 0), and can all be performed in parallel as they contribute the different accumulators. By allowing computation to proceed diagonal by diago- nal, the reuse of input (and coefficient) values can be greatly improved. Of course, any incomplete diag- onals require special handling. Usually the number of taps will be larger than the number of samples to be computed in parallel, so a three-phase computation strategy (spin up, compute, spin down) can be employed. All of these optimizations can be expressed using standard ANSI C, without recourse to non-portable assembly language.
Understanding the parallelism and data locality inherent in an algorithm is the key to constructing a high-performance implementation. However, keep in mind that optimizations such as loop unrolling and software pipelining will always increase code size. Clearly, the code optimization process is not always a simple one. For a comprehensive treatment of the subject of instruction-level parallelism, refer to Computer Architecture: A Quantitative Approach (reference 3).
The APU FPU is supplied with a demonstration design that contains several applications that are heavily dependent on floating-point. All the applications are graphical, displaying their output on a standard ANSI terminal using RS232. A facility is also provided for profiling the various routines. Full C source code is included.
The applications currently included in the demonstration suite are:
Mandelbrot and Feigenbaum fractal generation These fractals arise from simple iterative conformal mappings of complex and real numbers respectively. The inner loops consist of multiplication, addition and comparison instructions. Some divisions and integer conversions are also performed.
Gaussian random number tester This application generates normally-distributed random numbers by the Box-Muller transformation method, and compares the resulting distribution with a perfectly computed Gaussian bell curve. This involves several add, square root and divide operations, as well as numerous calls to the math library for exponentials and logarithms.
Solution of the traveling salesman problem by simulated annealing The Metropolis algorithm searches for a good (but not provably optimal) solution to the NP-complete Travelling Salesman problem. The path metric routines rely heavily on addition, multiplication and square root to find the Euclidean distance between two points in the itinerary. The cooling schedule requires the calculation of exponentials.
Fast Fourier transform. This application contains a straightforward routine to calculate the Fourier transform of a data series. The inner FFT loop is of the Cooley-Tukey (decimation-in-time) variety. The application generates a variety of real-valued waveforms (noise, square waves, sine waves), computes their Fourier transforms, then plots both the signals and their spectra.
March 11, 2008 Product Specification