APU Floating-Point Unit v3.1
x = sqrtf(x); /* uses single precision */ Array accesses and pointer ambiguity
It is difficult for the compiler to detect when two memory references (such as array element accesses) refer to the same location or not. The expected behavior is for the compiler to treat almost all array and pointer accesses as if they conflict. For example, the following code forms the inner loop of a simple Cooley-Tukey FFT algorithm implementation:
tr = ar0*Real[k] - ai0*Imag[k]; ti = ar0*Imag[k] + ai0*Real[k];
Real[k] = Real[j] Imag[k] = Imag[j] Real[j] += tr; Imag[j] += ti;
/* A */
/* B */
Because the compiler does not know that Real[k] and Real[j] are never the same element, the addi- tion in statement B cannot start until the addition in statement A is finished. This spurious dependency limits the amount of parallelism and slows down the computation. One possible solution is to intro- duce some temporary variables, and separate the memory accesses from the mathematics, like this:
r_k = Real[k]; i_k = Imag[k]; r_j = Real[j]; i_j = Imag[j]; t r = a r 0 * r _ k - a i 0 * i _ k ; t i = a r 0 * i _ k + a i 0 * r _ r_k = r_j - tr; i_k = i_j - ti; r_j += tr; i_j += ti; Real[j] = r_j; Real[k] = r_k; Imag[j] = i_j; Imag[k] = i_k; k ;
While this code is less concise, it gives much better results.
Also remember that arrays and pointers can often limit the compiler’s ability to allocate variables to registers. If you have small arrays of floating-point values, better performance may be possible if you declare a small number of individual variables instead (i.e. float a0, a1, a2 instead of float a), and unroll any loops that index into them.
Algorithm Optimization Example - FIR Filter
The theoretical peak performance is determined by the maximum issue rate. With the CPU running at 233 MHz, 58 million floating-point instructions can be issued per second. Because the FPU supports fused multiply-add instructions (which perform two floating point operations for one issue) the peak performance figure is 116 MFLOPS. In practice, this figure will not be attainable due to load and store instructions overhead, loop control, and stalls due to data hazards.
It is clear from the data that floating-point operations performed by the FPU have considerably higher latencies than integer operations carried out within the PowerPC core. To make best use of the FPU resources, the pipelined operators must be kept supplied with useful work. For example, consider an algorithm whose inner loop consists solely of single-precision multiply–add instructions. As noted in Table 3, this operation takes 10 CPU clock cycles to execute. With an issue rate of 1/4, three instructions could be issued before the first one has completed. For this to be achievable, there must not be any dependencies between these three instructions. If one instruction depends on the result of another that is still executing, the later instruction must be stalled until the earlier one completes. Such stalls cause a decrease in performance levels.
March 11, 2008 Product Specification