Agner wrote: * There are two execution units for floating point multiplication and for fused multiply-and-add, but only one execution unit for floating point addition. This design appears to be suboptimal since floating point code typically contains more additions than multiplications. McCalpin's Comments:
0. Definitely agree on the (typical) excess of FP Add over FP multiply operations. The ratios are mostly between 1:1 and 2:1, depending on the application area. I usually assume 1.5:1 in architectural analyses (while keeping in mind that this is a "fuzzy" estimate).
1. Of course one can always expand an FP Add into an FMA to run in the other pipeline. You need a YMM register to hold the dummy "1.0" multiplier values, but in principle it would not be difficult to teach a compiler this trick, along with suitable cost metrics to decide when to employ it.
2. Given the 3-cycle latency of FP Add and the 5-cycle latency of both FP Multiply and FP Fused-Multiply-Add, it seems reasonable to speculate that Intel only wanted to add the extra complexity of an "early out" mechanism on one execution port (Port 1). With no need to change the latency, it is trivial to support an isolated Multiply on either Multiply-Add pipeline. Also note that the other FP execution port (Port 0) is already burdened with the logic for FP divides, which is fairly extensive. On a related note, the latency and throughput numbers for FP divide on various Intel processors suggest that 128-bit FP divide operations perform two 64-bit divides in parallel. (Presumably taking the same number of iterative steps on both values, even if one could have an "early out".) For AVX on Sandy Bridge, Ivy Bridge, and Haswell the reciprocal throughput for the 256-bit FP divide instructions is twice the value for the 128-bit FP divide instructions. This suggests that only one 128-bit "lane" of the FP unit on Port 0 actually supports FP division, and that 256-bit FP operations are performed internally as a sequence of two 128-bit (2-way parallel) FP divide instructions. You show this in the instruction tables as 1 uop on Port 0 for 128-bit FP divide and 2 uops on Port 0 for 256-bit divide, but I had not seen anyone comment specifically on the absence of FP divide throughput speedup on AVX before, so I thought I would bring it up. Considering this lack of speedup with 256-bit AVX makes one wonder if the 512-bit FP divide instruction in AVX-512 will support higher throughput, or if they will leave the HW implementation where it is and emphasize the SW-pipelined approach (currently used by Xeon Phi, for example). By the time you get to 8-element vectors, the SW approach is almost certainly faster if you don't have to reach full 0.5 ulp precision. |