APU Floating-Point Unit v3.1

and square-root operators (if implemented) are not pipelined, so only one divide and one square-root operation can be ongoing at any time. The clock cycle figures shown include the time required to read and write the FP register file.

Add, Subtract

6

Multiply

5

Divide

17

Square Root

17

Convert

6

Fused Multiply-Add/Sub

10

Move, Abs, Neg, etc.

2

Compare

3

Table 3: FPU Operator Latencies

Instruction

FPU Clock Cycles Required

The FPU clock runs at half the speed of the CPU clock. In the current implementation, two FPU clock cycles are required to issue a floating-point instruction. Thus, an instruction can be issued every four CPU clock cycles.

The PowerPC architecture does not specify instructions for moving data between CPU registers (GPRs) and floating-point registers (FPRs). All FPU data transfers are therefore between the FPRs and main memory (or data cache, if used). A data load from cache takes at least three FPU clock cycles. Note that the current APU controller cannot process more than one outstanding load instruction, so this latency occurs on each load. Floating-point store operations take four FPU clock cycles (assuming that there is no data dependency on a previous instruction whose result is still outstanding).

When performing a sequence of multiply-accumulate operations using fused multiply-add instruc- tions, note that the initiation interval is only 6 cycles (time taken for an addition) rather than the full 10 cycles. The FPU achieves this by deferring the addend read in multiply-add instructions until the latest possible point, thus increasing the performance of key DSP algorithms.

# C Language Programming

To gain maximum benefit from the FPU without low-level assembly-language programming, it is important to consider how the C compiler will interpret your source code. Very often the same algo- rithm can be expressed in many different ways, and some are more efficient than others.

# Immediate Constants

Floating-point constants in C are double-precision by default. When using a single-precision FPU, care- less coding may result in double-precision software emulation routines being used instead of the native single-precision instructions. To avoid this, explicitly specify (by cast or suffix) that immediate con- stants in your arithmetic expressions are single-precision values.

# For example:

float x=0.0; … x += (float)1.0; /* float addition */

x += 1.0F;

/* alternative to above */

March 11, 2008 Product Specification

www.xilinx.com

9