Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Test results for Broadwell and Skylake - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky - Agner - 2015-12-27
last reply Sustained 64B loads per cycle on Haswell & Sky - Nathan Kurz - 2015-12-27
last reply Test results for Broadwell and Skylake - Peter Cordes - 2015-12-28
 
Test results for Broadwell and Skylake
Author: Agner Date: 2015-12-26 08:27
The optimization manuals at www.agner.org/optimize/#manuals have been updated. I have now tested the Intel Broadwell and Skylake processors. I have not tested the AMD Excavator and Puma because I cannot find suitable motherboards for testing them.

The test results show that the pipeline and execution units in Broadwell is very similar to its predecessor Haswell, while the Skylake has been reorganized a little.

The Skylake has a somewhat improved cache throughput and supports the new DDR4 RAM. This is important since RAM access is the bottleneck in many applications. On the other hand, the Skylake has reduced the level-2 cache associativity from 8 to 4.

Floating point division has been improved a little in Broadwell and integer division has been improved a little in Skylake. Gather instructions, which are used for collecting non-contiguous data from memory and joining them into a vector register, are improved somewhat in Broadwell, and a little more in Skylake. This makes it more efficient to collect data into vector registers.

Ever since the first Intel processor with out-of-order execution was released in 1995, there has been a limitation that no micro-operation could have more than two input dependencies. This meant that instructions with more than two input dependencies were split into two or more micro-operations. The introduction of fused multiply-and-add (FMA) instructions in Haswell made it necessary to overcome this limitation. Thus, the FMA instructions were the first instructions to be implemented with micro-operations with three input dependencies in an Intel processor. Once this limitation has been broken, the new capability can also be applied to other instructions. The Broadwell has extended the capability for three-input micro-operations to add-with-carry, subtract-with-borrow and conditional move instructions. The Skylake has extended it further to a blend instruction. AMD processors have never had this limitation of two input dependencies. Perhaps this is the reason why AMD came before Intel with FMA instructions.

The Haswell and Broadwell have two execution units for floating point multiplication and FMA, but only one for addition. This is odd since most floating point code has more additions than multiplications. To get the maximum floating point throughput on these processors, one might have to replace some additions with FMA instructions with a multiplier of 1. Fortunately, the Skylake has fixed this imbalance and made two floating point arithmetic units, both of which can handle both addition, multiplication and FMA. This gives a maximum throughput of two floating point vector operations per clock cycle.

The Skylake has increased the number of execution units for integer vector arithmetic from two to three. In general, the Skylake now has multiple execution units for almost all common operations (except memory write and data permutations). This means that an instruction or micro-operation rarely has to wait for a vacant execution unit. A throughput of four instructions per clock cycle is now a realistic goal for CPU-intensive code, unless the software contains long dependency chains. All arithmetic and logic units support vectors of up to 256 bits. The anticipated support for 512-bit vectors with the AVX-512 instruction set has been postponed to 2016 or 2017.

Intel's design has traditionally tried to standardize operation latencies, i. e. the number of clock cycles that a micro-operation takes. Operations with the same latencies were organized under the same execution port in order to avoid a clash when operations that start at different times would finish at the same time and so need the result bus at the same time. The Skylake microarchitecture has been improved to allow operations with several different latencies under the same execution port. There is still some standardization of latencies left, though. All floating point additions, multiplications and FMA operations have a latency of 4 clock cycles on Skylake, while these were 3 and 5 on previous processors.

Store forwarding is one clock cycle faster on Skylake than on previous processors. Store forwarding is the time it takes to read from a memory address immediately after writing to the same address.

Previous Intel processors have different states for code that use the AVX instruction sets allowing 256-bit vectors versus legacy code with 128-bit vectors and no VEX prefixes. The Sandy Bridge, Ivy Bridge, Haswell and Broadwell processors all have these states and a serious penalty of 70 clock cycles for state switching when a piece of code accidentally mixed VEX and non-VEX instructions. This annoying state shift and penalty has been eliminated on the Skylake. Apparently, the implementation of 256-bit registers has become more streamlined.

I observed an interesting phenomenon when executing 256-bit vector instructions on the Skylake. There is a warm-up period of approximately 14 µs before it can execute 256-bit vector instructions at full speed. Apparently, the upper 128-bit half of the execution units and data buses is turned off in order to save power when it is not used. As soon as the processor sees a 256-bit instruction it starts to power up the upper half. It can still execute 256-bit instructions during the warm-up period, but it does so by using the lower 128-bit units twice for every 256-bit vector. The result is that the throughput for 256-bit vectors is 4-5 times slower during this warm-up period. If you know in advance that you will need to use 256-bit instructions soon, then you can start the warm-up process by placing a dummy 256-bit instruction at a strategic place in the code. My measurements showed that the upper half of the units is shut down again after 675 µs of inactivity.

This warm-up phenomenon has reportedly been observed in previous processors as well (see agner.org/optimize/blog/read.php?i=378#378), but I have not observed it before in any of the processors that I have tested. Perhaps some high-end versions of Intel processors have this ability to shut down the upper 128-bit lane in order to save power, while other variants of the same processors have no such feature. This is something that needs further investigation.

   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2015-12-26 18:03
Hi Agner --

Great to see the updates for Skylake! Thanks for putting all the effort into making these. Your guides are tremendous resources.

You mention in your guides that bank conflicts should no longer be a problem for Haswell or Skylake, and that "There are two identical memory read ports (port 2 and 3) and one write port (port 4). These ports all have the full 256 bits width. This makes it possible to make two memory reads and one memory write per clock cycle, with any register size up to 256 bits.". You also say that cache bank conflicts are not a problem, and that "It is always possible to do two cache reads in the same clock cycle without causing a cache bank conflict."

Do you have code that demonstrates this? Even without writes, I'm currently unable to create code that can sustain 2 256-bit loads per cycle from L1D. I started with code that used a fused-multiply-add, but then realized that I was being slowed down by the loads rather than the math. I'm also seeing timing effects that make me suspect that some sort of bank conflict much be occurring, since some orderings of loads from L1 are consistently faster than others. I've put my current test code up here: https://gist.github.com/nkurz/9a0ed5a9a6e591019b8e

When compiled with "gcc -fno-inline -std=gnu99 -Wall -O3 -g -march=native l1d.c -o l1d", results look like this on Haswell:
Testing with SIZE=4096...
calc_simple(array1, array2, size): 5.01 cycles/input
calc_fma(array1, array2, size): 0.22 cycles/input
calc_fma_reordered(array1, array2, size): 0.20 cycles/input
calc_load_only(array1, array2, size): 0.21 cycles/input [ERROR]
calc_load_only_reordered(array1, array2, size): 0.18 cycles/input [ERROR]

And like this on Skylake:
Testing with SIZE=4096...
calc_simple(array1, array2, size): 4.02 cycles/input
calc_fma(array1, array2, size): 0.20 cycles/input
calc_fma_reordered(array1, array2, size): 0.17 cycles/input
calc_load_only(array1, array2, size): 0.20 cycles/input [ERROR]
calc_load_only_reordered(array1, array2, size): 0.17 cycles/input [ERROR]


calc_simple() shows that the latency of an FMA on Haswell is 5 cycles, while it's only 4 cycles on Skylake. It's a simple approach in that there is no unrolling, so we are latency limited. So far, so good.

calc_fma() shows a straightforward approach of loading 4 YMM vectors of floats, and then multiplying them by another 4 YMM vectors of floats, using 4 separate accumulators. Results are slightly slower on Haswell than on Skylake, presumably because 4-way unrolling is not enough to hide the 5 cycle latency of the FMA on Haswell.

calc_fma_reordered() is the first surprise. This is the same as calc_fma(), but loads the vectors in a different order: +96, +32, +64, +0 instead of the in-order byte offsets of +0, +32, +64, +96. I haven't seen any theory that would explain why there would be a difference in speed for these two orders.

calc_load_only() is the next surprise. I dropped the FMA altogether, and just did the loads. We get a slight speed up on Haswell (agreeing with the FMA latency), but no speed up on Skylake. Since there is nothing in the loop but the loads, if we can execute 2 32B loads per cycle, I would have expected to see .125 cycles per input. The [ERROR] on the line is expected, and is because we are not actually calculating the sum.

calc_load_only_reordered() continues the surprise. Once again, reading the vectors in non-linear order improves the speed considerably. But the speed is still much less than .125 that we would see for the theoretical 2 loads per cycle. Again, [ERROR] is expected because their is no math being done.

Do you have any idea what's happening here? Why would the ordering of the loads matter if all the results are in L1D? Why can't I get to .125 cycles per float? I've inspected the results with 'perf record -F 10000 ./l1d' / 'perf report' on both machines, and the assembly looks like I'd expect. I can make the loop logic slightly better, but this doesn't seem to be the limiting factor. What do I need to do differently to achieve sustained load speeds of 64B per cycle on Haswell and Skylake?

   
Sustained 64B loads per cycle on Haswell & Sky
Author: Agner Date: 2015-12-27 01:48
Nathan Kurz wrote:
reading the vectors in non-linear order improves the speed considerably. But the speed is still much less than .125 that we would see for the theoretical 2 loads per cycle.
It is possible to make two reads and one write in the same clock cycle, but it is not possible to obtain a continuous throughput at this theoretical maximum. You are always limited by cache ways, read/write buffers, faulty prefetching, suboptimal reordering, etc. The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7. It is quite likely that there are other effects that I am not aware of. The execution times that I have measured for 2 reads and 1 write are fluctuating a lot, and typically 40 - 60 % longer than the theoretical minimum.
   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2015-12-27 18:59
Agner wrote:
You are always limited by cache ways, read/write buffers, faulty prefetching, suboptimal reordering, etc.
Yes, although in my example I'm considering the much simpler case where there are two reads but no writes, and all data is already in L1. So although problematic in the real world, these shouldn't be a factor here. In fact, I see the same maximum speed if I read the same 4 vectors over and over rather than striding over all the data. I've refined my example, though, and think I now understand what's happening. The problem isn't a bank conflict, rather it's a slowdown due to unaligned access. I don't think I've seen this discussed before.

Contrary to my previous understanding, alignment makes a big difference on the speed at which vectors are read from L1 to register. If your data is 16B aligned rather than 32B aligned, a sequential read from L1 is no faster with 256-bit YMM reads than it is with 128-bit XMM reads. VMOVAPS and VMOVUPS have the same speed, but you cannot achieve 2 32B loads per cycle if the underlying data is not 32B aligned. If the data is 32B aligned, you still can't quite sustain 64 B/cycle of load with either, but you can get to about 54 B/cycle with both.

I put up new test code here: https://gist.github.com/nkurz/439ca1044e11181c1089

Results at L1 sizes are essentially the same on Haswell and Skylake.

Loading 4096 floats with 64 byte raw alignment
Vector alignment 8:
load_xmm : 19.79 bytes/cycle
load_xmm_nonsequential : 23.41 bytes/cycle
load_ymm : 28.64 bytes/cycle
load_ymm_nonsequential : 36.57 bytes/cycle

Vector alignment 16:
load_xmm : 29.26 bytes/cycle
load_xmm_nonsequential : 29.05 bytes/cycle
load_ymm : 28.44 bytes/cycle
load_ymm_nonsequential : 36.90 bytes/cycle

Vector alignment 24:
load_xmm : 19.79 bytes/cycle
load_xmm_nonsequential : 23.54 bytes/cycle
load_ymm : 28.64 bytes/cycle
load_ymm_nonsequential : 36.57 bytes/cycle

Vector alignment 32:
load_xmm : 29.05 bytes/cycle
load_xmm_nonsequential : 28.85 bytes/cycle
load_ymm : 53.19 bytes/cycle
load_ymm_nonsequential : 52.51 bytes/cycle

What this says is that unless your loads are 32B aligned, regardless
of method you are limited to about 40B loaded per cycle. If you are
sequentially loading non-32B aligned data from L1, the speeds for 16B
loads and 32B loads are identical, and limited to less than 32B per
cycle. All alignments not shown were the same as 8B alignment.

Loading in a non-sequential order is about 20% faster for unaligned
XMM and unaligned YMM loads. It's possible there is a faster order
than I have found so far. Aligned loads are the same speed
regardless of order. Maximum speed for aligned XMM loads is about 30
B/cycle, and maximum speed for aligned YMM loads is about 54 B/cycle.

At L2 sizes, the effect still exists, but is less extreme. XMM loads
are limited to 13-15 B/cycle on both Haswell and Skylake. On Haswell,
YMM non-aligned loads are 18-20 B/cycle, and YMM aligned loads are
24-26 B/cycle. On Skylake, YMM aligned loads are slightly faster at
27 B/cycle. Interestingly, sequential unaligned L2 loads on Skylake
are almost the same as aligned loads (26 B/cycle), while non-sequential
loads are much slower (17 B/cycle).

At L3 sizes, alignment is barey a factor. On Haswell, all loads are
limited to 11-13 B/cycle. On Skylake, XMM loads are the same 11-13
B/cycle, while YMM loads are slightly faster at 14-17 B/cycle.

Coming from memory, XMM and YMM loads on Haswell are the same
regardless of alignment, at about 5 B/cycle. On Skylake, XMM loads
are about 6.25 B/cycle, and YMM loads are about 6.75 B/cycle, with
little dependence on alignment. It's possible that prefetch can
improve these speeds slightly.

The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7.
I don't recall if you mention it in your manuals, but I presume you are aware that Port 7 on Haswell and Skylake is only capable of "simple" address calculations? Thus sustaining 2 loads and a store is only possible if the store address is [const + base] form rather than [const + index*scale + base]. And as you point out, even if you do this, it can still be difficult to force the processor to use only Port 7 for the store address.
   
Test results for Broadwell and Skylake
Author:  Date: 2015-12-28 06:19
Thanks for your excellent work on the instruction tables and microarchitecture guide.

Agner wrote:

This annoying state shift and penalty has been eliminated on the Skylake. Apparently, the implementation of 256-bit registers has become more streamlined.

I wonder if the performance penalty has been replaced with a power-consumption penalty. Perhaps there's still a "state C" where Skylake uses more power? The performance penalty on the earlier CPUs ensures most software will still avoid this. I don't think this is very likely; probably they came up with some clever way to avoid penalties except maybe when forwarding results from a non-VEX op to a 256b op (over the bypass network).

Do 128b non-VEX ops have a "false" dependency on the upper128 of a register? Is there a latency penalty when a 256b insn reads a ymm register last written by a non-VEX insn (or an extra uop to merge the xmm into the ymm)?

More importantly, is VZEROUPPER helpful in any way on Skylake? (Obviously this is a bad idea for binaries that might be run on older CPUs).

There is one use-case for mixing VEX and non-VEX : PBLENDVB x,x,xmm0 is 1 uop, p015. VPBLENDVB v,v,v,v is 2 uops, 2p015, and 2c latency. I'm picturing a function that needs to do a lot of blends, and but can also benefit from using 3-operand non-destructive VEX insns, except for non-VEX PBLENDVB.

Also: I remember reading something in a realworldtech forum thread about wider uop fetch in Skylake. (The forum isn't searchable, so I prob. can't find it now). Is there any improvement in the frontend for loops that don't fit in the loop buffer? I was hoping Skylake would fetch whole uop cache lines (up to 6 uops) per clock, and put them into a small buffer to more consistently issue 4 fused-domain uops per clock.

I've considered trying to align / re-ordering insns for uop-cache throughput in a loop that didn't quite fit in the loop buffer. I saw performance differences (on SnB) from reordering, but I never went beyond trial and error. I don't have an editor that shows the assembled binary updated on the fly as source edits are made, let alone with 32B boundaries marked and uops grouped into cache lines, so it would have been very time consuming.