Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

FP pipelines on Intel's Haswell core
Author:  Date: 2014-10-17 09:19
Agner wrote:

* There are two execution units for floating point multiplication and for fused multiply-and-add, but only one execution unit for floating point addition. This design appears to be suboptimal since floating point code typically contains more additions than multiplications.

McCalpin's Comments:
0. Definitely agree on the (typical) excess of FP Add over FP multiply operations. The ratios are mostly between 1:1 and 2:1, depending on the application area. I usually assume 1.5:1 in architectural analyses (while keeping in mind that this is a "fuzzy" estimate).
1. Of course one can always expand an FP Add into an FMA to run in the other pipeline. You need a YMM register to hold the dummy "1.0" multiplier values, but in principle it would not be difficult to teach a compiler this trick, along with suitable cost metrics to decide when to employ it.
2. Given the 3-cycle latency of FP Add and the 5-cycle latency of both FP Multiply and FP Fused-Multiply-Add, it seems reasonable to speculate that Intel only wanted to add the extra complexity of an "early out" mechanism on one execution port (Port 1). With no need to change the latency, it is trivial to support an isolated Multiply on either Multiply-Add pipeline. Also note that the other FP execution port (Port 0) is already burdened with the logic for FP divides, which is fairly extensive.

On a related note, the latency and throughput numbers for FP divide on various Intel processors suggest that 128-bit FP divide operations perform two 64-bit divides in parallel. (Presumably taking the same number of iterative steps on both values, even if one could have an "early out".) For AVX on Sandy Bridge, Ivy Bridge, and Haswell the reciprocal throughput for the 256-bit FP divide instructions is twice the value for the 128-bit FP divide instructions. This suggests that only one 128-bit "lane" of the FP unit on Port 0 actually supports FP division, and that 256-bit FP operations are performed internally as a sequence of two 128-bit (2-way parallel) FP divide instructions. You show this in the instruction tables as 1 uop on Port 0 for 128-bit FP divide and 2 uops on Port 0 for 256-bit divide, but I had not seen anyone comment specifically on the absence of FP divide throughput speedup on AVX before, so I thought I would bring it up.

Considering this lack of speedup with 256-bit AVX makes one wonder if the 512-bit FP divide instruction in AVX-512 will support higher throughput, or if they will leave the HW implementation where it is and emphasize the SW-pipelined approach (currently used by Xeon Phi, for example). By the time you get to 8-element vectors, the SW approach is almost certainly faster if you don't have to reach full 0.5 ulp precision.

 
thread Optimization manuals updated new - Agner - 2013-09-04
reply Optimization manuals updated new - Agner - 2014-02-19
replythread Latency of PTEST/VPTEST new - Nathan Kurz - 2014-05-20
last reply Latency of PTEST/VPTEST new - Agner - 2014-05-20
replythread Optimization manuals updated - Silvermont test new - Agner - 2014-08-08
last replythread Optimization manuals updated - Silvermont test new - Tacit Murky - 2014-08-11
last reply Optimization manuals updated - Silvermont test new - Agner - 2014-08-13
replythread Conditional operation new - Just_Coder - 2014-09-20
last replythread Conditional operation new - Agner - 2014-09-21
last reply Conditional operation new - Slacker - 2014-10-06
replythread Optimization manuals updated new - Slacker - 2014-10-06
last reply Optimization manuals updated new - jenya - 2014-10-10
replythread FP pipelines on Intel's Haswell core - John D. McCalpin - 2014-10-17
reply FP pipelines on Intel's Haswell core new - Agner - 2014-10-18
last replythread FP pipelines on Intel's Haswell core new - Jorcy de Oliveira Neto - 2015-09-24
last reply FP pipelines on Intel's Haswell core new - Agner - 2015-09-25
replythread Micro-fusion limited to 1-reg addressing modes new - Peter Cordes - 2015-07-11
replythread Micro-fusion limited to 1-reg addressing modes new - Agner - 2015-07-12
last reply Micro-fusion limited to 1-reg addressing modes new - Tacit Murky - 2015-11-15
last replythread Micro-fusion limited to 1-reg addressing modes new - Agner - 2015-12-01
reply Micro-fusion limited to 1-reg addressing modes new - Peter Cordes - 2015-12-15
last reply Micro-fusion limited to 1-reg addressing modes new - Peter Cordes - 2016-05-24
last replythread Skylake? new - Travis - 2015-10-21
last replythread Skylake? new - Agner - 2015-10-22
replythread Skylake? new - John D. McCalpin - 2015-10-22
reply Skylake? new - Adrian Bocaniciu - 2015-10-23
last reply Skylake? new - Bigos - 2015-10-23
last replythread Skylake? new - Slacker - 2015-10-24
last replythread Excavator and Puma new - Agner - 2015-12-16
reply Excavator and Puma new - Slacker - 2016-01-03
reply Excavator and Puma new - Daniel - 2016-01-16
last reply Excavator and Puma new - Jonathan Morton - 2016-02-02