Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Test results for Broadwell and Skylake - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky - Nathan Kurz - 2015-12-27
last reply Sustained 64B loads per cycle on Haswell & Sky - John D. McCalpin - 2016-01-04
replythread Test results for Broadwell and Skylake - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake - Agner - 2016-01-05
last reply Test results for Broadwell and Skylake - Tacit Murky - 2016-03-09
replythread Minor bug in the microarchitecture manual - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake - Jess - 2016-02-11
last reply Description of discrepancy - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake - Russell Van Zandt - 2016-02-22
last replythread Instruction Throughput on Skylake - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake - Nathan Kurz - 2016-04-26
last reply Instruction Throughput on Skylake - Agner - 2016-04-27
 
Test results for Broadwell and Skylake
Author: Agner Date: 2015-12-26 08:27
The optimization manuals at www.agner.org/optimize/#manuals have been updated. I have now tested the Intel Broadwell and Skylake processors. I have not tested the AMD Excavator and Puma because I cannot find suitable motherboards for testing them.

The test results show that the pipeline and execution units in Broadwell is very similar to its predecessor Haswell, while the Skylake has been reorganized a little.

The Skylake has a somewhat improved cache throughput and supports the new DDR4 RAM. This is important since RAM access is the bottleneck in many applications. On the other hand, the Skylake has reduced the level-2 cache associativity from 8 to 4.

Floating point division has been improved a little in Broadwell and integer division has been improved a little in Skylake. Gather instructions, which are used for collecting non-contiguous data from memory and joining them into a vector register, are improved somewhat in Broadwell, and a little more in Skylake. This makes it more efficient to collect data into vector registers.

Ever since the first Intel processor with out-of-order execution was released in 1995, there has been a limitation that no micro-operation could have more than two input dependencies. This meant that instructions with more than two input dependencies were split into two or more micro-operations. The introduction of fused multiply-and-add (FMA) instructions in Haswell made it necessary to overcome this limitation. Thus, the FMA instructions were the first instructions to be implemented with micro-operations with three input dependencies in an Intel processor. Once this limitation has been broken, the new capability can also be applied to other instructions. The Broadwell has extended the capability for three-input micro-operations to add-with-carry, subtract-with-borrow and conditional move instructions. The Skylake has extended it further to a blend instruction. AMD processors have never had this limitation of two input dependencies. Perhaps this is the reason why AMD came before Intel with FMA instructions.

The Haswell and Broadwell have two execution units for floating point multiplication and FMA, but only one for addition. This is odd since most floating point code has more additions than multiplications. To get the maximum floating point throughput on these processors, one might have to replace some additions with FMA instructions with a multiplier of 1. Fortunately, the Skylake has fixed this imbalance and made two floating point arithmetic units, both of which can handle both addition, multiplication and FMA. This gives a maximum throughput of two floating point vector operations per clock cycle.

The Skylake has increased the number of execution units for integer vector arithmetic from two to three. In general, the Skylake now has multiple execution units for almost all common operations (except memory write and data permutations). This means that an instruction or micro-operation rarely has to wait for a vacant execution unit. A throughput of four instructions per clock cycle is now a realistic goal for CPU-intensive code, unless the software contains long dependency chains. All arithmetic and logic units support vectors of up to 256 bits. The anticipated support for 512-bit vectors with the AVX-512 instruction set has been postponed to 2016 or 2017.

Intel's design has traditionally tried to standardize operation latencies, i. e. the number of clock cycles that a micro-operation takes. Operations with the same latencies were organized under the same execution port in order to avoid a clash when operations that start at different times would finish at the same time and so need the result bus at the same time. The Skylake microarchitecture has been improved to allow operations with several different latencies under the same execution port. There is still some standardization of latencies left, though. All floating point additions, multiplications and FMA operations have a latency of 4 clock cycles on Skylake, while these were 3 and 5 on previous processors.

Store forwarding is one clock cycle faster on Skylake than on previous processors. Store forwarding is the time it takes to read from a memory address immediately after writing to the same address.

Previous Intel processors have different states for code that use the AVX instruction sets allowing 256-bit vectors versus legacy code with 128-bit vectors and no VEX prefixes. The Sandy Bridge, Ivy Bridge, Haswell and Broadwell processors all have these states and a serious penalty of 70 clock cycles for state switching when a piece of code accidentally mixed VEX and non-VEX instructions. This annoying state shift and penalty has been eliminated on the Skylake. Apparently, the implementation of 256-bit registers has become more streamlined.

I observed an interesting phenomenon when executing 256-bit vector instructions on the Skylake. There is a warm-up period of approximately 14 µs before it can execute 256-bit vector instructions at full speed. Apparently, the upper 128-bit half of the execution units and data buses is turned off in order to save power when it is not used. As soon as the processor sees a 256-bit instruction it starts to power up the upper half. It can still execute 256-bit instructions during the warm-up period, but it does so by using the lower 128-bit units twice for every 256-bit vector. The result is that the throughput for 256-bit vectors is 4-5 times slower during this warm-up period. If you know in advance that you will need to use 256-bit instructions soon, then you can start the warm-up process by placing a dummy 256-bit instruction at a strategic place in the code. My measurements showed that the upper half of the units is shut down again after 675 µs of inactivity.

This warm-up phenomenon has reportedly been observed in previous processors as well (see agner.org/optimize/blog/read.php?i=378#378), but I have not observed it before in any of the processors that I have tested. Perhaps some high-end versions of Intel processors have this ability to shut down the upper 128-bit lane in order to save power, while other variants of the same processors have no such feature. This is something that needs further investigation.

   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2015-12-26 18:03
Hi Agner --

Great to see the updates for Skylake! Thanks for putting all the effort into making these. Your guides are tremendous resources.

You mention in your guides that bank conflicts should no longer be a problem for Haswell or Skylake, and that "There are two identical memory read ports (port 2 and 3) and one write port (port 4). These ports all have the full 256 bits width. This makes it possible to make two memory reads and one memory write per clock cycle, with any register size up to 256 bits.". You also say that cache bank conflicts are not a problem, and that "It is always possible to do two cache reads in the same clock cycle without causing a cache bank conflict."

Do you have code that demonstrates this? Even without writes, I'm currently unable to create code that can sustain 2 256-bit loads per cycle from L1D. I started with code that used a fused-multiply-add, but then realized that I was being slowed down by the loads rather than the math. I'm also seeing timing effects that make me suspect that some sort of bank conflict much be occurring, since some orderings of loads from L1 are consistently faster than others. I've put my current test code up here: https://gist.github.com/nkurz/9a0ed5a9a6e591019b8e

When compiled with "gcc -fno-inline -std=gnu99 -Wall -O3 -g -march=native l1d.c -o l1d", results look like this on Haswell:
Testing with SIZE=4096...
calc_simple(array1, array2, size): 5.01 cycles/input
calc_fma(array1, array2, size): 0.22 cycles/input
calc_fma_reordered(array1, array2, size): 0.20 cycles/input
calc_load_only(array1, array2, size): 0.21 cycles/input [ERROR]
calc_load_only_reordered(array1, array2, size): 0.18 cycles/input [ERROR]

And like this on Skylake:
Testing with SIZE=4096...
calc_simple(array1, array2, size): 4.02 cycles/input
calc_fma(array1, array2, size): 0.20 cycles/input
calc_fma_reordered(array1, array2, size): 0.17 cycles/input
calc_load_only(array1, array2, size): 0.20 cycles/input [ERROR]
calc_load_only_reordered(array1, array2, size): 0.17 cycles/input [ERROR]


calc_simple() shows that the latency of an FMA on Haswell is 5 cycles, while it's only 4 cycles on Skylake. It's a simple approach in that there is no unrolling, so we are latency limited. So far, so good.

calc_fma() shows a straightforward approach of loading 4 YMM vectors of floats, and then multiplying them by another 4 YMM vectors of floats, using 4 separate accumulators. Results are slightly slower on Haswell than on Skylake, presumably because 4-way unrolling is not enough to hide the 5 cycle latency of the FMA on Haswell.

calc_fma_reordered() is the first surprise. This is the same as calc_fma(), but loads the vectors in a different order: +96, +32, +64, +0 instead of the in-order byte offsets of +0, +32, +64, +96. I haven't seen any theory that would explain why there would be a difference in speed for these two orders.

calc_load_only() is the next surprise. I dropped the FMA altogether, and just did the loads. We get a slight speed up on Haswell (agreeing with the FMA latency), but no speed up on Skylake. Since there is nothing in the loop but the loads, if we can execute 2 32B loads per cycle, I would have expected to see .125 cycles per input. The [ERROR] on the line is expected, and is because we are not actually calculating the sum.

calc_load_only_reordered() continues the surprise. Once again, reading the vectors in non-linear order improves the speed considerably. But the speed is still much less than .125 that we would see for the theoretical 2 loads per cycle. Again, [ERROR] is expected because their is no math being done.

Do you have any idea what's happening here? Why would the ordering of the loads matter if all the results are in L1D? Why can't I get to .125 cycles per float? I've inspected the results with 'perf record -F 10000 ./l1d' / 'perf report' on both machines, and the assembly looks like I'd expect. I can make the loop logic slightly better, but this doesn't seem to be the limiting factor. What do I need to do differently to achieve sustained load speeds of 64B per cycle on Haswell and Skylake?

   
Sustained 64B loads per cycle on Haswell & Sky
Author: Agner Date: 2015-12-27 01:48
Nathan Kurz wrote:
reading the vectors in non-linear order improves the speed considerably. But the speed is still much less than .125 that we would see for the theoretical 2 loads per cycle.
It is possible to make two reads and one write in the same clock cycle, but it is not possible to obtain a continuous throughput at this theoretical maximum. You are always limited by cache ways, read/write buffers, faulty prefetching, suboptimal reordering, etc. The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7. It is quite likely that there are other effects that I am not aware of. The execution times that I have measured for 2 reads and 1 write are fluctuating a lot, and typically 40 - 60 % longer than the theoretical minimum.
   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2015-12-27 18:59
Agner wrote:
You are always limited by cache ways, read/write buffers, faulty prefetching, suboptimal reordering, etc.
Yes, although in my example I'm considering the much simpler case where there are two reads but no writes, and all data is already in L1. So although problematic in the real world, these shouldn't be a factor here. In fact, I see the same maximum speed if I read the same 4 vectors over and over rather than striding over all the data. I've refined my example, though, and think I now understand what's happening. The problem isn't a bank conflict, rather it's a slowdown due to unaligned access. I don't think I've seen this discussed before.

Contrary to my previous understanding, alignment makes a big difference on the speed at which vectors are read from L1 to register. If your data is 16B aligned rather than 32B aligned, a sequential read from L1 is no faster with 256-bit YMM reads than it is with 128-bit XMM reads. VMOVAPS and VMOVUPS have the same speed, but you cannot achieve 2 32B loads per cycle if the underlying data is not 32B aligned. If the data is 32B aligned, you still can't quite sustain 64 B/cycle of load with either, but you can get to about 54 B/cycle with both.

I put up new test code here: https://gist.github.com/nkurz/439ca1044e11181c1089

Results at L1 sizes are essentially the same on Haswell and Skylake.

Loading 4096 floats with 64 byte raw alignment
Vector alignment 8:
load_xmm : 19.79 bytes/cycle
load_xmm_nonsequential : 23.41 bytes/cycle
load_ymm : 28.64 bytes/cycle
load_ymm_nonsequential : 36.57 bytes/cycle

Vector alignment 16:
load_xmm : 29.26 bytes/cycle
load_xmm_nonsequential : 29.05 bytes/cycle
load_ymm : 28.44 bytes/cycle
load_ymm_nonsequential : 36.90 bytes/cycle

Vector alignment 24:
load_xmm : 19.79 bytes/cycle
load_xmm_nonsequential : 23.54 bytes/cycle
load_ymm : 28.64 bytes/cycle
load_ymm_nonsequential : 36.57 bytes/cycle

Vector alignment 32:
load_xmm : 29.05 bytes/cycle
load_xmm_nonsequential : 28.85 bytes/cycle
load_ymm : 53.19 bytes/cycle
load_ymm_nonsequential : 52.51 bytes/cycle

What this says is that unless your loads are 32B aligned, regardless
of method you are limited to about 40B loaded per cycle. If you are
sequentially loading non-32B aligned data from L1, the speeds for 16B
loads and 32B loads are identical, and limited to less than 32B per
cycle. All alignments not shown were the same as 8B alignment.

Loading in a non-sequential order is about 20% faster for unaligned
XMM and unaligned YMM loads. It's possible there is a faster order
than I have found so far. Aligned loads are the same speed
regardless of order. Maximum speed for aligned XMM loads is about 30
B/cycle, and maximum speed for aligned YMM loads is about 54 B/cycle.

At L2 sizes, the effect still exists, but is less extreme. XMM loads
are limited to 13-15 B/cycle on both Haswell and Skylake. On Haswell,
YMM non-aligned loads are 18-20 B/cycle, and YMM aligned loads are
24-26 B/cycle. On Skylake, YMM aligned loads are slightly faster at
27 B/cycle. Interestingly, sequential unaligned L2 loads on Skylake
are almost the same as aligned loads (26 B/cycle), while non-sequential
loads are much slower (17 B/cycle).

At L3 sizes, alignment is barey a factor. On Haswell, all loads are
limited to 11-13 B/cycle. On Skylake, XMM loads are the same 11-13
B/cycle, while YMM loads are slightly faster at 14-17 B/cycle.

Coming from memory, XMM and YMM loads on Haswell are the same
regardless of alignment, at about 5 B/cycle. On Skylake, XMM loads
are about 6.25 B/cycle, and YMM loads are about 6.75 B/cycle, with
little dependence on alignment. It's possible that prefetch can
improve these speeds slightly.

The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7.
I don't recall if you mention it in your manuals, but I presume you are aware that Port 7 on Haswell and Skylake is only capable of "simple" address calculations? Thus sustaining 2 loads and a store is only possible if the store address is [const + base] form rather than [const + index*scale + base]. And as you point out, even if you do this, it can still be difficult to force the processor to use only Port 7 for the store address.
   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2016-01-04 07:21
Thanks to Nathan Kurz for the interesting test code.

I was able to reproduce the results on a Xeon E5-2660 v3 system once I pinned the core frequency to match the nominal frequency (2.5 GHz on that system).

It looks like the results are actually a bit better than reported because the tests are short enough that the timer overhead is not negligible. I modified the code to print out the "cycle_diff" variable in each case and see that the fastest tests are only about 312 cycles. RDTSCP overhead on this system is 32 cycles (for my very similar inline assembly), which suggests that the loop is only taking about 280 cycles. This raises the estimate of the throughput from 52.5 Bytes/cycle to 52.5*312/280 = 58.5 Bytes/cycle. This is 91.4% of peak, which is almost as fast as the best results I have been able to obtain with a DDOT kernel.

For my DDOT measurements, I ran a variety of problem sizes and did a least-squares fit to estimate the slope and intercept of the cycle count as a function of problem size. This gave estimated slopes corresponding to up to ~95% of 64 Bytes/cycle. (I used this approach because I was reading not only the TSC, but up to 8 PMCs as well, and the total overhead became quite large -- well over 200 cycles.)

In my experience, it is exceedingly difficult to understand performance limiters once you have reached this level of performance -- even if you are on the hardware engineering team! As a rule of thumb, anything exceeding 8/9 (88.9%) of the simple theoretical peak is pretty close to asymptotic, and exceeding 16/17 (94.1%) of peak is extremely uncommon.

   
Test results for Broadwell and Skylake
Author:  Date: 2015-12-28 06:19
Thanks for your excellent work on the instruction tables and microarchitecture guide.

Agner wrote:

This annoying state shift and penalty has been eliminated on the Skylake. Apparently, the implementation of 256-bit registers has become more streamlined.

I wonder if the performance penalty has been replaced with a power-consumption penalty. Perhaps there's still a "state C" where Skylake uses more power? The performance penalty on the earlier CPUs ensures most software will still avoid this. I don't think this is very likely; probably they came up with some clever way to avoid penalties except maybe when forwarding results from a non-VEX op to a 256b op (over the bypass network).

Do 128b non-VEX ops have a "false" dependency on the upper128 of a register? Is there a latency penalty when a 256b insn reads a ymm register last written by a non-VEX insn (or an extra uop to merge the xmm into the ymm)?

More importantly, is VZEROUPPER helpful in any way on Skylake? (Obviously this is a bad idea for binaries that might be run on older CPUs).

There is one use-case for mixing VEX and non-VEX : PBLENDVB x,x,xmm0 is 1 uop, p015. VPBLENDVB v,v,v,v is 2 uops, 2p015, and 2c latency. I'm picturing a function that needs to do a lot of blends, and but can also benefit from using 3-operand non-destructive VEX insns, except for non-VEX PBLENDVB.

Also: I remember reading something in a realworldtech forum thread about wider uop fetch in Skylake. (The forum isn't searchable, so I prob. can't find it now). Is there any improvement in the frontend for loops that don't fit in the loop buffer? I was hoping Skylake would fetch whole uop cache lines (up to 6 uops) per clock, and put them into a small buffer to more consistently issue 4 fused-domain uops per clock.

I've considered trying to align / re-ordering insns for uop-cache throughput in a loop that didn't quite fit in the loop buffer. I saw performance differences (on SnB) from reordering, but I never went beyond trial and error. I don't have an editor that shows the assembled binary updated on the fly as source edits are made, let alone with 32B boundaries marked and uops grouped into cache lines, so it would have been very time consuming.

   
Test results for Broadwell and Skylake
Author: Agner Date: 2015-12-29 01:36
Peter Cordes wrote:
Perhaps there's still a "state C" where Skylake uses more power?
I find no evidence of states, and I don't think it requires more power. The 128/256-bit vectors are probably treated somewhat like 8/16/32/64 bit general purpose registers.
Do 128b non-VEX ops have a "false" dependency on the upper128 of a register? Is there a latency penalty when a 256b insn reads a ymm register last written by a non-VEX insn (or an extra uop to merge the xmm into the ymm)?
There is false dependency and 1 clock extra latency, but no extra µop seen in the counters. I see no difference in the clock counts here whether the 128-bit instruction has VEX prefix or not.
   
Test results for Broadwell and Skylake
Author:  Date: 2016-01-04 15:04
Hello, Agner. Thanks for detailed work, but there is some strangeness in the results, what looks like mistakes. Here's 2 examples:
For Haswell — «MOVBE r64,m64» is a 3-mop instruction with TP of 0.5 CPI (2 IPC), which is impossible with 4 IPC total pipeline restriction. AIDA64 readout (see instlatx64.atw.hu ) shows 1 IPC here.
For Skylake — «PMUL* (v,)v,v» is a 1-mop instruction with only 1 IPC, despite 2 ports available for execution (p01). AIDA64 shows TP of 2 IPC (0.5 CPI) because of second integer multiplier.
There are more minor mistakes elsewhere.
   
Test results for Broadwell and Skylake
Author: Agner Date: 2016-01-05 13:16
You are right.
The throughput for MOVBE r64,m64 is 4 instructions per 3 clock cycles.
The throughput for integer vector multiplication instructions and several other integer vector instructions is 2 instructions per clock for 128-bit and 256-bit registers, but 1 instruction per clock for 64-bit registers, because port 0 supports these instructions for all vector sizes, while port 1 supports the same instructions only for 128-bit and 256-bit vectors.
   
Test results for Broadwell and Skylake
Author:  Date: 2016-03-09 20:58
More stuff. Have you measured total T-put of immediate data? AIDA64 readout is inconsistent and may be erroneous. Things to consider:
1) Legacy decoder should have different T-put than µop-cache; IDQ queue may or may not impose it's own restrictions.
2) As it is known for SB and IB (but may not be true for Haswell and newer CPUs; would be cool to test all of them), µop-cache slot has 4 bytes of data for both imm and ofs fields; so if (there is 8-byte const) or (total length of imm and ofs consts is >4 bytes) — 2 entries are allocated for that µop. Literal pool in scheduler may have it's own restrictions in port number (3…6) and width (4 or 8 bytes).
3) Instructions of interest:
—MOV r32/64,imm32/64 : 4/8 bytes of literals per instruction with 4 IPC of max. T-put (ideally should be 16/32 bytes/cl.);
—ADD r32,imm32 : 4 bytes of literals per instruction with 4 IPC of max. T-put;
—BLENDPS/PD xmm,[r+ofs32],imm8 : 5 bytes of total literals per instruction with 3 IPC of max. T-put, but only 2 L1D reads/cl.; may substitute 3-rd blend with MOVAPS [r+ofs32],xmm , having 5+5+4=14 bytes of literals for 3 IPC (but 5 µops).
   
Minor bug in the microarchitecture manual
Author:  Date: 2016-01-10 13:05
Hi Agner, thanks a lot for your manuals, they're an invaluable source, even better then the official ones.

I've noticed a small error in microarchitecture.pdf. At pag.148 (description of Skylake's pipeline), you say that "The sizes of the reorder buffer, reservation station and register file have allegedly been increased, but the details have not been published".
Their sizes have been publishes (224 slots for the ROB, 97 RS entries, 180 PREGS, and so on), you can view them on pag.12 of this presentation from IDF15 (it's the SPCS001 session)

https://hubb.blob.core.windows.net/e5888822-986f-45f5-b1d7-08f96e618a7b-published/73ed87d8-209a-4ca1-b456-42a167ffd0bd/SPCS001%20-%20SF15_SPCS001_103f.pdf?sv=2014-02-14&sr=c&sig=XKetbBtWcJzdBjJEc1bFubMzOrEPpoVcK6%2Bm693ZUts%3D&se=2016-01-11T18%3A50%3A10Z&sp=rwd

Thanks again and keep up with the good work!

   
Minor bug in the microarchitecture manual
Author: Agner Date: 2016-01-16 03:26
Thanks for the tip. The link doesn't work. I found it here: myeventagenda.com/sessions/0B9F4191-1C29-408A-8B61-65D7520025A8/7/5 session SPCS001.
   
Test results for Broadwell and Skylake
Author:  Date: 2016-01-12 13:54
I just ran across some performance counter bugs on Haswell that may influence one's interpretation of instruction retirement rates and may bias measurements of uops per instruction.

I put performance counters around 100 (outer) iterations of a simple 10-instruction loop that executed 1000 times. According to Agner's instruction tables this loop should have 12 uops. Both the fixed-function "instructions retired" and the programmable "INST_RETIRED.ANY_P" events report 12 instructions per loop iteration (not 10), while the UOPS_RETIRED.ALL programmable counter event reported 14 uops per loop iteration (not 12). While I could be misinterpreting the uop counts, there is no way that I could have mis-counted the instructions --- it took all of my fingers, but did not generate an overflow condition. ;-)

It turns out that there are a number of errata for both the instructions retired events and the uops retired event on all Intel Haswell processors. Somewhat perversely, the different Haswell products have different errata listed, even though they have the same DISPLAYFAMILY_DISPLAYMODEL designation, but all of them that I checked (Xeon E5 v3 (HSE71 in doc 330785), Xeon E3 v3 (HSW141 in doc 328908), and 4th Generation Core Desktop (HSD140 in doc 328899)) include an errata to the effect that the "instructions retired" counts may overcount or undercount. This errata is also listed for the 5th Generation Core (Broadwell) processors (BDM61 in doc 330836), but is not listed in the "specification update" document for the Skylake processors (doc 332689).

For this particular loop the counts are completely stable with respect to variations in loop length (e.g., from 500 to 11000 shows no effect other than asymptotically decreasing overhead). The machine is running with HyperThreading enabled, but there are no other users or non-OS tasks and this job was pinned to (local) core 4 on socket 1, so there is no way that interference with another thread (mentioned in several other errata) could account for seeing identical behavior over several hundred trials.

Reading between the lines, the language that Intel uses in the descriptions of this performance counter errata seems consistent with the language used in other cases for which the errors are not "large" (not approaching 100%), but are also not "small" (not limited to single-digit percentages). It is very hard to decide whether I want to take the time to try to characterize or bound this particular performance counter error. It may end up having an easy story, or it may end up being completely inexplicable without inspection of the processor RTL.

   
Test results for Broadwell and Skylake
Author:  Date: 2016-02-11 11:00
I notice that SKD044 on page 28 this PDF:

www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf

explains why the discrepancy occurs and how large it is likely to be for this chip. Similar errata for other chips seem to be less detailed, though I haven't checked exhaustively.

   
Description of discrepancy
Author:  Date: 2016-03-13 17:54
Jess wrote:
I notice that SKD044 on page 28 this PDF:

www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf

explains why the discrepancy occurs and how large it is likely to be for this chip.

I appreciate the link, but I'm unable to find the portion that you refer to. Could you point more exactly to the details you found?

SKD044 doesn't exist in that document, SKL044 is about WRMSR, and nothing on page 28 seems relevant. I did find SKD044 in a different document (http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/6th-gen-core-u-y-spec-update.pdf) but still about WRMSR. The closest erratum I did find was SKL048 "Processor May Run Intel AVX Code Much Slower than Expected", but this is only when coming out of C6, and doesn't give other details.

   
Test results for Broadwell and Skylake
Author:  Date: 2016-02-22 17:50
Thank you all for the useful information. FYI, the latest Intel architecture optimization manual discusses the Skylake changes for the mixed AVX / SSE problem in great detail, including diagrams and tables. This is in section 11.3 Mixing AVX code with SEE code in the January 2016 edition. Skylake has not eliminated the problem entirely, with "partial register dependency + blend" as the penalty in one mode, and ~XSAVE in another mode. Use of VZEROUPPER is still recommended, in rule 72. "The Skylake microarchitecture implements a different state machine than prior generations to manage the YMM state transition associated with mixing SSE and AVX instructions. It no longer saves the entire upper YMM state transition ... but saves the upper bits of individual register. As a result ... will experience a penalty associated with partial register dependency...".

Other topics discussed include "Align data to 32 bytes", which was recently discussed in this blog too. Section 11.6.1

There is lots and lots of Skylake material, including the tradeoffs between electrical power reduction vs. performance. Like "The latency of the PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as140 cycles... There's also a small power benefit in 2-core and 4-core systems... As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss." Section 8.4.7

   
Instruction Throughput on Skylake
Author:  Date: 2016-04-23 13:16
In the Section 11 "Skylake" of your Microarchitecture Guide (http://www.agner.org/optimize/microarchitecture.pdf), you say: "There are four decoders, which can handle instructions generating up to four μops per clock cycle in the way described on page 121 for Sandy Bridge" and "Code that runs out of the μop cache are not subject to the limitations of the fetch and decode units. It can deliver a throughput of 4 (possibly fused) μops or the equivalent of 32 bytes of code per clock cycle."

This seems contradicted by Section 2.1 "Skylake Microarchitecture" of the Intel Optimization manual (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf): "Legacy Decode Pipeline delivery of 5 uops per cycle to the IDQ compared to 4 uops in previous generations" and "The DSB delivers 6 uops per cycle to the IDQ compared to 4 uops in previous generations." These numbers also match Figure 2.1 in that guide, which makes me think the Intel manual is probably correct here.

About Skylake, you also say "It is designed for a throughput of four instructions per clock cycle." I've recently measured a few results that make me wonder if it's actually capable of more than that. Did you happen to do any tests that would confirm whether Skylake might be able to sustain 5 or 6 unfused instructions per cycle (thus possibly 7 or 8 including fused branches not taken) if the correct execution ports are available? From the published specs, I haven't been able to find evidence of a hard limit of 4 unfused instructions per cycle.

One stage for which I haven't been able to find documentation of the Skylake limits is retirement. Section 2.6.5 on Hyperthreading Retirement says "If one logical processor is not ready to retire any instructions, then all retirement bandwidth is dedicated to the other logical processor." I've seen claims that Skylake has "wider Hyperthreading retirement" than previous generations, and there is also a documented performance monitor event for "Cycles with less than 10 actually retired uops", which would imply that the maximum is at least 10. Do you know if this is true?

   
Instruction Throughput on Skylake
Author: Agner Date: 2016-04-24 00:02
Nathan Kurz wrote:
Did you happen to do any tests that would confirm whether Skylake might be able to sustain 5 or 6 unfused instructions per cycle (thus possibly 7 or 8 including fused branches not taken) if the correct execution ports are available?
NOPs have a throughput of 4 per clock cycle, and NOPs are not using any execution unit. I have never seen a higher throughput than 4 if you count a fused jump as one instruction. If two threads are running in the same core then each thread gets 2 NOPs per clock.

It is possible that the decoders have a higher throughput, but then there must be a bottleneck somewhere else. This will be hard to verify.

   
Instruction Throughput on Skylake
Author:  Date: 2016-04-26 13:50
Agner wrote:
It is possible that the decoders have a higher throughput, but then there must be a bottleneck somewhere else. This will be hard to verify.
I'm starting to understand this better. Using Likwid and defining some custom events, I've determined that Skylake can sustain execution and retirement of 5 or 6 µops per cycle. This is ignoring jump/cc "macro-fusion", which would presumably boost us up to 7 or 8. The bottleneck appears to be the "renamer", which can only "issue" 4 µops per cycle.
The question is "What constitutes a µop for this stage?"

In 2.3.3.1 of the Intel Optimization Guide, when discussing Sandy Bridge it says: "The Renamer is the bridge between the in-order part in Figure 2-5, and the dataflow world of the Scheduler. It moves up to four micro-ops every cycle from the micro-op queue to the out-of-order engine. Although the renamer can send up to 4 micro-ops (unfused, micro-fused, or macro-fused) per cycle, this is equivalent to the issue port can dispatch six micro-ops per cycle."

The grammar is atrocious, but I think it means that while the Renamer can only move 4 µops, these can be micro-fused µops that will be "unlaminated" to a load µop and an action µop. From what I can tell, Skylake can move 6 fused µops per cycle from the DSB to the IDQ, but can only "issue" 4 fused µops per cycle from the IDQ. But since the scheduler only handles unfused µops, this means that we can "dispatch" up to twice that many depending on fusion.

The result of this is that while it is probably true to say that Skylake is "designed for a throughput of four instructions per clock cycle", instructions per clock cycle can be poor metric to use when comparing fused and unfused instructions. Previously, I'd naively thought that once the instructions were decoded to the DSB, that it didn't matter whether one expressed LOAD-OP as a single instruction, or as a separate LOAD then OP.

But if one is being constrained by the Renamer, it turns out that it can make a big difference in total execution time. For example, I'm finding that in a tight loop, this (two combined load-adds):

#define ASM_ADD_ADD_INDEX(in, sum1, sum2, index) \
__asm volatile ("add 0x0(%[IN], %[INDEX]), %[SUM1]\n" \
"add 0x8(%[IN], %[INDEX]), %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index))


Is about 20% faster than this (two separate loads and adds):

#define ASM_LOAD_LOAD_INDEX(in, sum1, sum2, index, tmp) \
__asm volatile ("mov 0x0(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM1]\n" \
"mov 0x8(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index), \
[TMP] "=r" (tmp))

While the hybrid (one and one) is the same speed as the fast version:

#define ASM_LOAD_ADD_INDEX(in, sum1, sum2, index, tmp) \
__asm volatile ("mov 0x0(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM1]\n" \
"add 0x8(%[IN], %[INDEX]), %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index), \
[TMP] "=r" (tmp))


What I don't understand yet is why all variations that directly increment %[IN] are almost twice as slow as the versions that use and increment %[INDEX]:

#define ASM_ADD_ADD_DIRECT(in, sum1, sum2) \
__asm volatile ("add 0x0(%[IN]), %[SUM1]\n" \
"add 0x8(%[IN]), %[SUM2]\n" \
"add $0x10, %[IN]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2))

I also don't understand yet why I get 30% faster speeds for loops small enough to fit in the LSD than when unrolled such that the number of µops requires the DSB. Apparently the Loop Stream Detector still plays a performance roll in some cases.

   
Instruction Throughput on Skylake
Author: Agner Date: 2016-04-27 01:14
Nathan Kurz wrote:
The bottleneck appears to be the "renamer", which can only "issue" 4 µops per cycle.
I think the decoding front end and the renamer are designed with a 4-wide pipeline for a throughput of four µops per clock. These µops are queuing up in the reservation station if execution of them is delayed for any reason. The scheduler can issue more than 4 µops per clock cycle in bursts until the queue is empty.

I also don't understand yet why I get 30% faster speeds for loops small enough to fit in the LSD than when unrolled such that the number of µops requires the DSB. Apparently the Loop Stream Detector still plays a performance roll in some cases.
Instruction fetch and decode is often a bottleneck - you need to check the instruction lengths. Alignment of the loop entry can also influence the results. Finally, you will often see cache effects influencing the results in a less than obvious way.