Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Test results for Broadwell and Skylake - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky - John D. McCalpin - 2016-01-04
last reply Sustained 64B loads per cycle on Haswell & Sky - T - 2016-06-18
replythread Test results for Broadwell and Skylake - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake - Jess - 2016-02-11
last reply Description of discrepancy - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake - T - 2016-06-18
reply Instruction Throughput on Skylake - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake - Nathan Kurz - 2016-07-11
reply Instruction Throughput on Skylake - Tacit Murky - 2016-07-17
last reply Instruction Throughput on Skylake - T - 2016-08-08
last reply Unlamination of micro-fused ops in SKL and earlier - Travis - 2016-09-09
 
Test results for Broadwell and Skylake
Author: Agner Date: 2015-12-26 08:27
The optimization manuals at www.agner.org/optimize/#manuals have been updated. I have now tested the Intel Broadwell and Skylake processors. I have not tested the AMD Excavator and Puma because I cannot find suitable motherboards for testing them.

The test results show that the pipeline and execution units in Broadwell is very similar to its predecessor Haswell, while the Skylake has been reorganized a little.

The Skylake has a somewhat improved cache throughput and supports the new DDR4 RAM. This is important since RAM access is the bottleneck in many applications. On the other hand, the Skylake has reduced the level-2 cache associativity from 8 to 4.

Floating point division has been improved a little in Broadwell and integer division has been improved a little in Skylake. Gather instructions, which are used for collecting non-contiguous data from memory and joining them into a vector register, are improved somewhat in Broadwell, and a little more in Skylake. This makes it more efficient to collect data into vector registers.

Ever since the first Intel processor with out-of-order execution was released in 1995, there has been a limitation that no micro-operation could have more than two input dependencies. This meant that instructions with more than two input dependencies were split into two or more micro-operations. The introduction of fused multiply-and-add (FMA) instructions in Haswell made it necessary to overcome this limitation. Thus, the FMA instructions were the first instructions to be implemented with micro-operations with three input dependencies in an Intel processor. Once this limitation has been broken, the new capability can also be applied to other instructions. The Broadwell has extended the capability for three-input micro-operations to add-with-carry, subtract-with-borrow and conditional move instructions. The Skylake has extended it further to a blend instruction. AMD processors have never had this limitation of two input dependencies. Perhaps this is the reason why AMD came before Intel with FMA instructions.

The Haswell and Broadwell have two execution units for floating point multiplication and FMA, but only one for addition. This is odd since most floating point code has more additions than multiplications. To get the maximum floating point throughput on these processors, one might have to replace some additions with FMA instructions with a multiplier of 1. Fortunately, the Skylake has fixed this imbalance and made two floating point arithmetic units, both of which can handle both addition, multiplication and FMA. This gives a maximum throughput of two floating point vector operations per clock cycle.

The Skylake has increased the number of execution units for integer vector arithmetic from two to three. In general, the Skylake now has multiple execution units for almost all common operations (except memory write and data permutations). This means that an instruction or micro-operation rarely has to wait for a vacant execution unit. A throughput of four instructions per clock cycle is now a realistic goal for CPU-intensive code, unless the software contains long dependency chains. All arithmetic and logic units support vectors of up to 256 bits. The anticipated support for 512-bit vectors with the AVX-512 instruction set has been postponed to 2016 or 2017.

Intel's design has traditionally tried to standardize operation latencies, i. e. the number of clock cycles that a micro-operation takes. Operations with the same latencies were organized under the same execution port in order to avoid a clash when operations that start at different times would finish at the same time and so need the result bus at the same time. The Skylake microarchitecture has been improved to allow operations with several different latencies under the same execution port. There is still some standardization of latencies left, though. All floating point additions, multiplications and FMA operations have a latency of 4 clock cycles on Skylake, while these were 3 and 5 on previous processors.

Store forwarding is one clock cycle faster on Skylake than on previous processors. Store forwarding is the time it takes to read from a memory address immediately after writing to the same address.

Previous Intel processors have different states for code that use the AVX instruction sets allowing 256-bit vectors versus legacy code with 128-bit vectors and no VEX prefixes. The Sandy Bridge, Ivy Bridge, Haswell and Broadwell processors all have these states and a serious penalty of 70 clock cycles for state switching when a piece of code accidentally mixed VEX and non-VEX instructions. This annoying state shift and penalty has been eliminated on the Skylake. Apparently, the implementation of 256-bit registers has become more streamlined.

I observed an interesting phenomenon when executing 256-bit vector instructions on the Skylake. There is a warm-up period of approximately 14 µs before it can execute 256-bit vector instructions at full speed. Apparently, the upper 128-bit half of the execution units and data buses is turned off in order to save power when it is not used. As soon as the processor sees a 256-bit instruction it starts to power up the upper half. It can still execute 256-bit instructions during the warm-up period, but it does so by using the lower 128-bit units twice for every 256-bit vector. The result is that the throughput for 256-bit vectors is 4-5 times slower during this warm-up period. If you know in advance that you will need to use 256-bit instructions soon, then you can start the warm-up process by placing a dummy 256-bit instruction at a strategic place in the code. My measurements showed that the upper half of the units is shut down again after 675 µs of inactivity.

This warm-up phenomenon has reportedly been observed in previous processors as well (see agner.org/optimize/blog/read.php?i=378#378), but I have not observed it before in any of the processors that I have tested. Perhaps some high-end versions of Intel processors have this ability to shut down the upper 128-bit lane in order to save power, while other variants of the same processors have no such feature. This is something that needs further investigation.

   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2015-12-26 18:03
Hi Agner --

Great to see the updates for Skylake! Thanks for putting all the effort into making these. Your guides are tremendous resources.

You mention in your guides that bank conflicts should no longer be a problem for Haswell or Skylake, and that "There are two identical memory read ports (port 2 and 3) and one write port (port 4). These ports all have the full 256 bits width. This makes it possible to make two memory reads and one memory write per clock cycle, with any register size up to 256 bits.". You also say that cache bank conflicts are not a problem, and that "It is always possible to do two cache reads in the same clock cycle without causing a cache bank conflict."

Do you have code that demonstrates this? Even without writes, I'm currently unable to create code that can sustain 2 256-bit loads per cycle from L1D. I started with code that used a fused-multiply-add, but then realized that I was being slowed down by the loads rather than the math. I'm also seeing timing effects that make me suspect that some sort of bank conflict much be occurring, since some orderings of loads from L1 are consistently faster than others. I've put my current test code up here: https://gist.github.com/nkurz/9a0ed5a9a6e591019b8e

When compiled with "gcc -fno-inline -std=gnu99 -Wall -O3 -g -march=native l1d.c -o l1d", results look like this on Haswell:
Testing with SIZE=4096...
calc_simple(array1, array2, size): 5.01 cycles/input
calc_fma(array1, array2, size): 0.22 cycles/input
calc_fma_reordered(array1, array2, size): 0.20 cycles/input
calc_load_only(array1, array2, size): 0.21 cycles/input [ERROR]
calc_load_only_reordered(array1, array2, size): 0.18 cycles/input [ERROR]

And like this on Skylake:
Testing with SIZE=4096...
calc_simple(array1, array2, size): 4.02 cycles/input
calc_fma(array1, array2, size): 0.20 cycles/input
calc_fma_reordered(array1, array2, size): 0.17 cycles/input
calc_load_only(array1, array2, size): 0.20 cycles/input [ERROR]
calc_load_only_reordered(array1, array2, size): 0.17 cycles/input [ERROR]


calc_simple() shows that the latency of an FMA on Haswell is 5 cycles, while it's only 4 cycles on Skylake. It's a simple approach in that there is no unrolling, so we are latency limited. So far, so good.

calc_fma() shows a straightforward approach of loading 4 YMM vectors of floats, and then multiplying them by another 4 YMM vectors of floats, using 4 separate accumulators. Results are slightly slower on Haswell than on Skylake, presumably because 4-way unrolling is not enough to hide the 5 cycle latency of the FMA on Haswell.

calc_fma_reordered() is the first surprise. This is the same as calc_fma(), but loads the vectors in a different order: +96, +32, +64, +0 instead of the in-order byte offsets of +0, +32, +64, +96. I haven't seen any theory that would explain why there would be a difference in speed for these two orders.

calc_load_only() is the next surprise. I dropped the FMA altogether, and just did the loads. We get a slight speed up on Haswell (agreeing with the FMA latency), but no speed up on Skylake. Since there is nothing in the loop but the loads, if we can execute 2 32B loads per cycle, I would have expected to see .125 cycles per input. The [ERROR] on the line is expected, and is because we are not actually calculating the sum.

calc_load_only_reordered() continues the surprise. Once again, reading the vectors in non-linear order improves the speed considerably. But the speed is still much less than .125 that we would see for the theoretical 2 loads per cycle. Again, [ERROR] is expected because their is no math being done.

Do you have any idea what's happening here? Why would the ordering of the loads matter if all the results are in L1D? Why can't I get to .125 cycles per float? I've inspected the results with 'perf record -F 10000 ./l1d' / 'perf report' on both machines, and the assembly looks like I'd expect. I can make the loop logic slightly better, but this doesn't seem to be the limiting factor. What do I need to do differently to achieve sustained load speeds of 64B per cycle on Haswell and Skylake?

   
Sustained 64B loads per cycle on Haswell & Sky
Author: Agner Date: 2015-12-27 01:48
Nathan Kurz wrote:
reading the vectors in non-linear order improves the speed considerably. But the speed is still much less than .125 that we would see for the theoretical 2 loads per cycle.
It is possible to make two reads and one write in the same clock cycle, but it is not possible to obtain a continuous throughput at this theoretical maximum. You are always limited by cache ways, read/write buffers, faulty prefetching, suboptimal reordering, etc. The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7. It is quite likely that there are other effects that I am not aware of. The execution times that I have measured for 2 reads and 1 write are fluctuating a lot, and typically 40 - 60 % longer than the theoretical minimum.
   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2015-12-27 18:59
Agner wrote:
You are always limited by cache ways, read/write buffers, faulty prefetching, suboptimal reordering, etc.
Yes, although in my example I'm considering the much simpler case where there are two reads but no writes, and all data is already in L1. So although problematic in the real world, these shouldn't be a factor here. In fact, I see the same maximum speed if I read the same 4 vectors over and over rather than striding over all the data. I've refined my example, though, and think I now understand what's happening. The problem isn't a bank conflict, rather it's a slowdown due to unaligned access. I don't think I've seen this discussed before.

Contrary to my previous understanding, alignment makes a big difference on the speed at which vectors are read from L1 to register. If your data is 16B aligned rather than 32B aligned, a sequential read from L1 is no faster with 256-bit YMM reads than it is with 128-bit XMM reads. VMOVAPS and VMOVUPS have the same speed, but you cannot achieve 2 32B loads per cycle if the underlying data is not 32B aligned. If the data is 32B aligned, you still can't quite sustain 64 B/cycle of load with either, but you can get to about 54 B/cycle with both.

I put up new test code here: https://gist.github.com/nkurz/439ca1044e11181c1089

Results at L1 sizes are essentially the same on Haswell and Skylake.

Loading 4096 floats with 64 byte raw alignment
Vector alignment 8:
load_xmm : 19.79 bytes/cycle
load_xmm_nonsequential : 23.41 bytes/cycle
load_ymm : 28.64 bytes/cycle
load_ymm_nonsequential : 36.57 bytes/cycle

Vector alignment 16:
load_xmm : 29.26 bytes/cycle
load_xmm_nonsequential : 29.05 bytes/cycle
load_ymm : 28.44 bytes/cycle
load_ymm_nonsequential : 36.90 bytes/cycle

Vector alignment 24:
load_xmm : 19.79 bytes/cycle
load_xmm_nonsequential : 23.54 bytes/cycle
load_ymm : 28.64 bytes/cycle
load_ymm_nonsequential : 36.57 bytes/cycle

Vector alignment 32:
load_xmm : 29.05 bytes/cycle
load_xmm_nonsequential : 28.85 bytes/cycle
load_ymm : 53.19 bytes/cycle
load_ymm_nonsequential : 52.51 bytes/cycle

What this says is that unless your loads are 32B aligned, regardless
of method you are limited to about 40B loaded per cycle. If you are
sequentially loading non-32B aligned data from L1, the speeds for 16B
loads and 32B loads are identical, and limited to less than 32B per
cycle. All alignments not shown were the same as 8B alignment.

Loading in a non-sequential order is about 20% faster for unaligned
XMM and unaligned YMM loads. It's possible there is a faster order
than I have found so far. Aligned loads are the same speed
regardless of order. Maximum speed for aligned XMM loads is about 30
B/cycle, and maximum speed for aligned YMM loads is about 54 B/cycle.

At L2 sizes, the effect still exists, but is less extreme. XMM loads
are limited to 13-15 B/cycle on both Haswell and Skylake. On Haswell,
YMM non-aligned loads are 18-20 B/cycle, and YMM aligned loads are
24-26 B/cycle. On Skylake, YMM aligned loads are slightly faster at
27 B/cycle. Interestingly, sequential unaligned L2 loads on Skylake
are almost the same as aligned loads (26 B/cycle), while non-sequential
loads are much slower (17 B/cycle).

At L3 sizes, alignment is barey a factor. On Haswell, all loads are
limited to 11-13 B/cycle. On Skylake, XMM loads are the same 11-13
B/cycle, while YMM loads are slightly faster at 14-17 B/cycle.

Coming from memory, XMM and YMM loads on Haswell are the same
regardless of alignment, at about 5 B/cycle. On Skylake, XMM loads
are about 6.25 B/cycle, and YMM loads are about 6.75 B/cycle, with
little dependence on alignment. It's possible that prefetch can
improve these speeds slightly.

The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7.
I don't recall if you mention it in your manuals, but I presume you are aware that Port 7 on Haswell and Skylake is only capable of "simple" address calculations? Thus sustaining 2 loads and a store is only possible if the store address is [const + base] form rather than [const + index*scale + base]. And as you point out, even if you do this, it can still be difficult to force the processor to use only Port 7 for the store address.
   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2016-01-04 07:21
Thanks to Nathan Kurz for the interesting test code.

I was able to reproduce the results on a Xeon E5-2660 v3 system once I pinned the core frequency to match the nominal frequency (2.5 GHz on that system).

It looks like the results are actually a bit better than reported because the tests are short enough that the timer overhead is not negligible. I modified the code to print out the "cycle_diff" variable in each case and see that the fastest tests are only about 312 cycles. RDTSCP overhead on this system is 32 cycles (for my very similar inline assembly), which suggests that the loop is only taking about 280 cycles. This raises the estimate of the throughput from 52.5 Bytes/cycle to 52.5*312/280 = 58.5 Bytes/cycle. This is 91.4% of peak, which is almost as fast as the best results I have been able to obtain with a DDOT kernel.

For my DDOT measurements, I ran a variety of problem sizes and did a least-squares fit to estimate the slope and intercept of the cycle count as a function of problem size. This gave estimated slopes corresponding to up to ~95% of 64 Bytes/cycle. (I used this approach because I was reading not only the TSC, but up to 8 PMCs as well, and the total overhead became quite large -- well over 200 cycles.)

In my experience, it is exceedingly difficult to understand performance limiters once you have reached this level of performance -- even if you are on the hardware engineering team! As a rule of thumb, anything exceeding 8/9 (88.9%) of the simple theoretical peak is pretty close to asymptotic, and exceeding 16/17 (94.1%) of peak is extremely uncommon.

   
Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2016-06-18 20:32
The aligned vs unaligned results make intuitive sense. In recent processors, the penalty for unaligned access has been made faster: the penalty went to zero on Sandy Bridge (and perhaps earlier), at least for loads that didn't cross a 64B cache-line boundary. In Haswell, even the 64B latency penalty disappeared - although only for loads, not stores. You can see this all graphically here:

blog.stuffedcow.net/2014/01/x86-memory-disambiguation/

The 2D charts are trying to get at the penalty of store-to-load forwarding, but the cells off of the main diagonal do a great job of showing the unaligned load/store penalties as well.

So you are finding that unaligned loads *still* have a penalty, even on Skylake - right? The key is loads that cross a 64B boundary. Fundamentally that requires bringing in two different lines from the L1, and merging the results so you get a word composed of some one line and some of another. The improvements culminating in Haswell reduced the latency of this operation to the point where it fits inside the standard 4 cycle latency for ideal L1 access, but it can't avoid the double bandwidth usage of the unaligned loads. In many algorithms, the maximum bandwidth of the L1 isn't approached (i.e,. the loads-per-cycle are 1 or less), so unaligned access ends up the same as aligned. In your loop, however, you do saturate the load bandwidth, so loads that cross a 64B boundary will cut your throughput in half, or worse.

It doesn't explain the results you got by inverting the load order, but perhaps some of that can be explained by how the loads "pair up". That is, two aligned loads can pair up in the same cycle since each only needs 1 of the 2 "load paths" from L1. An unaligned load needs both, however. So if you have a load pattern like AAUUAAUU (where A is an aligned load and U is unaligned) you get:

cycle loads
0 AA
1 U
2 U
3 AA
4 U
5 U
...

So you get 4 loads every 3 cycles, because the aligned loads are always able to pair.

On the other hand, if you have a load pattern like AUAUAUAUA, you get the following:

cycle loads
0 A
1 U
2 A
3 U
....

I.e., only 3 loads every 3 cycles, or a 25% penalty to throughput, because the aligned loads end up being singletons as well. You might ask why OoO wouldn't solve this - well OoO is based on the scheduler which understands instruction dependencies, and has a few other special-case tricks to re-order things (e.g,. to avoid port retirement conflicts), but otherwise still does stuff in-order. So likely can't understand that it should try to reorder the loads to pair aligned loads. Furthermore the memory model imposes restrictions on reodering loads (but I don't fully grok how this actually falls out in practice when you consider load buffers and the coherency protocol and so on).

All that to say that reordering the loads might easily swap the behavior from an AAUU behavior to an AUAU one.

   
Test results for Broadwell and Skylake
Author:  Date: 2015-12-28 06:19
Thanks for your excellent work on the instruction tables and microarchitecture guide.

Agner wrote:

This annoying state shift and penalty has been eliminated on the Skylake. Apparently, the implementation of 256-bit registers has become more streamlined.

I wonder if the performance penalty has been replaced with a power-consumption penalty. Perhaps there's still a "state C" where Skylake uses more power? The performance penalty on the earlier CPUs ensures most software will still avoid this. I don't think this is very likely; probably they came up with some clever way to avoid penalties except maybe when forwarding results from a non-VEX op to a 256b op (over the bypass network).

Do 128b non-VEX ops have a "false" dependency on the upper128 of a register? Is there a latency penalty when a 256b insn reads a ymm register last written by a non-VEX insn (or an extra uop to merge the xmm into the ymm)?

More importantly, is VZEROUPPER helpful in any way on Skylake? (Obviously this is a bad idea for binaries that might be run on older CPUs).

There is one use-case for mixing VEX and non-VEX : PBLENDVB x,x,xmm0 is 1 uop, p015. VPBLENDVB v,v,v,v is 2 uops, 2p015, and 2c latency. I'm picturing a function that needs to do a lot of blends, and but can also benefit from using 3-operand non-destructive VEX insns, except for non-VEX PBLENDVB.

Also: I remember reading something in a realworldtech forum thread about wider uop fetch in Skylake. (The forum isn't searchable, so I prob. can't find it now). Is there any improvement in the frontend for loops that don't fit in the loop buffer? I was hoping Skylake would fetch whole uop cache lines (up to 6 uops) per clock, and put them into a small buffer to more consistently issue 4 fused-domain uops per clock.

I've considered trying to align / re-ordering insns for uop-cache throughput in a loop that didn't quite fit in the loop buffer. I saw performance differences (on SnB) from reordering, but I never went beyond trial and error. I don't have an editor that shows the assembled binary updated on the fly as source edits are made, let alone with 32B boundaries marked and uops grouped into cache lines, so it would have been very time consuming.

   
Test results for Broadwell and Skylake
Author: Agner Date: 2015-12-29 01:36
Peter Cordes wrote:
Perhaps there's still a "state C" where Skylake uses more power?
I find no evidence of states, and I don't think it requires more power. The 128/256-bit vectors are probably treated somewhat like 8/16/32/64 bit general purpose registers.
Do 128b non-VEX ops have a "false" dependency on the upper128 of a register? Is there a latency penalty when a 256b insn reads a ymm register last written by a non-VEX insn (or an extra uop to merge the xmm into the ymm)?
There is false dependency and 1 clock extra latency, but no extra µop seen in the counters. I see no difference in the clock counts here whether the 128-bit instruction has VEX prefix or not.
   
Test results for Broadwell and Skylake
Author:  Date: 2016-01-04 15:04
Hello, Agner. Thanks for detailed work, but there is some strangeness in the results, what looks like mistakes. Here's 2 examples:
For Haswell — «MOVBE r64,m64» is a 3-mop instruction with TP of 0.5 CPI (2 IPC), which is impossible with 4 IPC total pipeline restriction. AIDA64 readout (see instlatx64.atw.hu ) shows 1 IPC here.
For Skylake — «PMUL* (v,)v,v» is a 1-mop instruction with only 1 IPC, despite 2 ports available for execution (p01). AIDA64 shows TP of 2 IPC (0.5 CPI) because of second integer multiplier.
There are more minor mistakes elsewhere.
   
Test results for Broadwell and Skylake
Author: Agner Date: 2016-01-05 13:16
You are right.
The throughput for MOVBE r64,m64 is 4 instructions per 3 clock cycles.
The throughput for integer vector multiplication instructions and several other integer vector instructions is 2 instructions per clock for 128-bit and 256-bit registers, but 1 instruction per clock for 64-bit registers, because port 0 supports these instructions for all vector sizes, while port 1 supports the same instructions only for 128-bit and 256-bit vectors.
   
Test results for Broadwell and Skylake
Author:  Date: 2016-03-09 20:58
More stuff. Have you measured total T-put of immediate data? AIDA64 readout is inconsistent and may be erroneous. Things to consider:
1) Legacy decoder should have different T-put than µop-cache; IDQ queue may or may not impose it's own restrictions.
2) As it is known for SB and IB (but may not be true for Haswell and newer CPUs; would be cool to test all of them), µop-cache slot has 4 bytes of data for both imm and ofs fields; so if (there is 8-byte const) or (total length of imm and ofs consts is >4 bytes) — 2 entries are allocated for that µop. Literal pool in scheduler may have it's own restrictions in port number (3…6) and width (4 or 8 bytes).
3) Instructions of interest:
—MOV r32/64,imm32/64 : 4/8 bytes of literals per instruction with 4 IPC of max. T-put (ideally should be 16/32 bytes/cl.);
—ADD r32,imm32 : 4 bytes of literals per instruction with 4 IPC of max. T-put;
—BLENDPS/PD xmm,[r+ofs32],imm8 : 5 bytes of total literals per instruction with 3 IPC of max. T-put, but only 2 L1D reads/cl.; may substitute 3-rd blend with MOVAPS [r+ofs32],xmm , having 5+5+4=14 bytes of literals for 3 IPC (but 5 µops).
   
Test results for Broadwell and Skylake
Author:  Date: 2016-06-05 15:26
Intel's Optimisation Manual says certian things on Skylake's OoO-machine updates:
1. «Legacy decode pipeline» can deliver 5 µops/cl to IDQ, 1 more than before;
2. DSB can deliver 6 µops/cl to IDQ, 2 more than before;
3. There are 2 IDQ's (1 per thread) 64 µops each; all 64 can be used for a loop (in both threads);
4. Improved SMT performance with HT on; by longer latency Pause instruction and/or by wider Retire.

All of this is contradictory with your results about Skylake. Or was that info related only to Bwl?

   
Minor bug in the microarchitecture manual
Author:  Date: 2016-01-10 13:05
Hi Agner, thanks a lot for your manuals, they're an invaluable source, even better then the official ones.

I've noticed a small error in microarchitecture.pdf. At pag.148 (description of Skylake's pipeline), you say that "The sizes of the reorder buffer, reservation station and register file have allegedly been increased, but the details have not been published".
Their sizes have been publishes (224 slots for the ROB, 97 RS entries, 180 PREGS, and so on), you can view them on pag.12 of this presentation from IDF15 (it's the SPCS001 session)

https://hubb.blob.core.windows.net/e5888822-986f-45f5-b1d7-08f96e618a7b-published/73ed87d8-209a-4ca1-b456-42a167ffd0bd/SPCS001%20-%20SF15_SPCS001_103f.pdf?sv=2014-02-14&sr=c&sig=XKetbBtWcJzdBjJEc1bFubMzOrEPpoVcK6%2Bm693ZUts%3D&se=2016-01-11T18%3A50%3A10Z&sp=rwd

Thanks again and keep up with the good work!

   
Minor bug in the microarchitecture manual
Author: Agner Date: 2016-01-16 03:26
Thanks for the tip. The link doesn't work. I found it here: myeventagenda.com/sessions/0B9F4191-1C29-408A-8B61-65D7520025A8/7/5 session SPCS001.
   
Test results for Broadwell and Skylake
Author:  Date: 2016-01-12 13:54
I just ran across some performance counter bugs on Haswell that may influence one's interpretation of instruction retirement rates and may bias measurements of uops per instruction.

I put performance counters around 100 (outer) iterations of a simple 10-instruction loop that executed 1000 times. According to Agner's instruction tables this loop should have 12 uops. Both the fixed-function "instructions retired" and the programmable "INST_RETIRED.ANY_P" events report 12 instructions per loop iteration (not 10), while the UOPS_RETIRED.ALL programmable counter event reported 14 uops per loop iteration (not 12). While I could be misinterpreting the uop counts, there is no way that I could have mis-counted the instructions --- it took all of my fingers, but did not generate an overflow condition. ;-)

It turns out that there are a number of errata for both the instructions retired events and the uops retired event on all Intel Haswell processors. Somewhat perversely, the different Haswell products have different errata listed, even though they have the same DISPLAYFAMILY_DISPLAYMODEL designation, but all of them that I checked (Xeon E5 v3 (HSE71 in doc 330785), Xeon E3 v3 (HSW141 in doc 328908), and 4th Generation Core Desktop (HSD140 in doc 328899)) include an errata to the effect that the "instructions retired" counts may overcount or undercount. This errata is also listed for the 5th Generation Core (Broadwell) processors (BDM61 in doc 330836), but is not listed in the "specification update" document for the Skylake processors (doc 332689).

For this particular loop the counts are completely stable with respect to variations in loop length (e.g., from 500 to 11000 shows no effect other than asymptotically decreasing overhead). The machine is running with HyperThreading enabled, but there are no other users or non-OS tasks and this job was pinned to (local) core 4 on socket 1, so there is no way that interference with another thread (mentioned in several other errata) could account for seeing identical behavior over several hundred trials.

Reading between the lines, the language that Intel uses in the descriptions of this performance counter errata seems consistent with the language used in other cases for which the errors are not "large" (not approaching 100%), but are also not "small" (not limited to single-digit percentages). It is very hard to decide whether I want to take the time to try to characterize or bound this particular performance counter error. It may end up having an easy story, or it may end up being completely inexplicable without inspection of the processor RTL.

   
Test results for Broadwell and Skylake
Author:  Date: 2016-02-11 11:00
I notice that SKD044 on page 28 this PDF:

www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf

explains why the discrepancy occurs and how large it is likely to be for this chip. Similar errata for other chips seem to be less detailed, though I haven't checked exhaustively.

   
Description of discrepancy
Author:  Date: 2016-03-13 17:54
Jess wrote:
I notice that SKD044 on page 28 this PDF:

www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf

explains why the discrepancy occurs and how large it is likely to be for this chip.

I appreciate the link, but I'm unable to find the portion that you refer to. Could you point more exactly to the details you found?

SKD044 doesn't exist in that document, SKL044 is about WRMSR, and nothing on page 28 seems relevant. I did find SKD044 in a different document (http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/6th-gen-core-u-y-spec-update.pdf) but still about WRMSR. The closest erratum I did find was SKL048 "Processor May Run Intel AVX Code Much Slower than Expected", but this is only when coming out of C6, and doesn't give other details.

   
Test results for Broadwell and Skylake
Author:  Date: 2016-02-22 17:50
Thank you all for the useful information. FYI, the latest Intel architecture optimization manual discusses the Skylake changes for the mixed AVX / SSE problem in great detail, including diagrams and tables. This is in section 11.3 Mixing AVX code with SEE code in the January 2016 edition. Skylake has not eliminated the problem entirely, with "partial register dependency + blend" as the penalty in one mode, and ~XSAVE in another mode. Use of VZEROUPPER is still recommended, in rule 72. "The Skylake microarchitecture implements a different state machine than prior generations to manage the YMM state transition associated with mixing SSE and AVX instructions. It no longer saves the entire upper YMM state transition ... but saves the upper bits of individual register. As a result ... will experience a penalty associated with partial register dependency...".

Other topics discussed include "Align data to 32 bytes", which was recently discussed in this blog too. Section 11.6.1

There is lots and lots of Skylake material, including the tradeoffs between electrical power reduction vs. performance. Like "The latency of the PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as140 cycles... There's also a small power benefit in 2-core and 4-core systems... As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss." Section 8.4.7

   
Instruction Throughput on Skylake
Author:  Date: 2016-04-23 13:16
In the Section 11 "Skylake" of your Microarchitecture Guide (http://www.agner.org/optimize/microarchitecture.pdf), you say: "There are four decoders, which can handle instructions generating up to four μops per clock cycle in the way described on page 121 for Sandy Bridge" and "Code that runs out of the μop cache are not subject to the limitations of the fetch and decode units. It can deliver a throughput of 4 (possibly fused) μops or the equivalent of 32 bytes of code per clock cycle."

This seems contradicted by Section 2.1 "Skylake Microarchitecture" of the Intel Optimization manual (http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf): "Legacy Decode Pipeline delivery of 5 uops per cycle to the IDQ compared to 4 uops in previous generations" and "The DSB delivers 6 uops per cycle to the IDQ compared to 4 uops in previous generations." These numbers also match Figure 2.1 in that guide, which makes me think the Intel manual is probably correct here.

About Skylake, you also say "It is designed for a throughput of four instructions per clock cycle." I've recently measured a few results that make me wonder if it's actually capable of more than that. Did you happen to do any tests that would confirm whether Skylake might be able to sustain 5 or 6 unfused instructions per cycle (thus possibly 7 or 8 including fused branches not taken) if the correct execution ports are available? From the published specs, I haven't been able to find evidence of a hard limit of 4 unfused instructions per cycle.

One stage for which I haven't been able to find documentation of the Skylake limits is retirement. Section 2.6.5 on Hyperthreading Retirement says "If one logical processor is not ready to retire any instructions, then all retirement bandwidth is dedicated to the other logical processor." I've seen claims that Skylake has "wider Hyperthreading retirement" than previous generations, and there is also a documented performance monitor event for "Cycles with less than 10 actually retired uops", which would imply that the maximum is at least 10. Do you know if this is true?

   
Instruction Throughput on Skylake
Author: Agner Date: 2016-04-24 00:02
Nathan Kurz wrote:
Did you happen to do any tests that would confirm whether Skylake might be able to sustain 5 or 6 unfused instructions per cycle (thus possibly 7 or 8 including fused branches not taken) if the correct execution ports are available?
NOPs have a throughput of 4 per clock cycle, and NOPs are not using any execution unit. I have never seen a higher throughput than 4 if you count a fused jump as one instruction. If two threads are running in the same core then each thread gets 2 NOPs per clock.

It is possible that the decoders have a higher throughput, but then there must be a bottleneck somewhere else. This will be hard to verify.

   
Instruction Throughput on Skylake
Author:  Date: 2016-04-26 13:50
Agner wrote:
It is possible that the decoders have a higher throughput, but then there must be a bottleneck somewhere else. This will be hard to verify.
I'm starting to understand this better. Using Likwid and defining some custom events, I've determined that Skylake can sustain execution and retirement of 5 or 6 µops per cycle. This is ignoring jump/cc "macro-fusion", which would presumably boost us up to 7 or 8. The bottleneck appears to be the "renamer", which can only "issue" 4 µops per cycle.
The question is "What constitutes a µop for this stage?"

In 2.3.3.1 of the Intel Optimization Guide, when discussing Sandy Bridge it says: "The Renamer is the bridge between the in-order part in Figure 2-5, and the dataflow world of the Scheduler. It moves up to four micro-ops every cycle from the micro-op queue to the out-of-order engine. Although the renamer can send up to 4 micro-ops (unfused, micro-fused, or macro-fused) per cycle, this is equivalent to the issue port can dispatch six micro-ops per cycle."

The grammar is atrocious, but I think it means that while the Renamer can only move 4 µops, these can be micro-fused µops that will be "unlaminated" to a load µop and an action µop. From what I can tell, Skylake can move 6 fused µops per cycle from the DSB to the IDQ, but can only "issue" 4 fused µops per cycle from the IDQ. But since the scheduler only handles unfused µops, this means that we can "dispatch" up to twice that many depending on fusion.

The result of this is that while it is probably true to say that Skylake is "designed for a throughput of four instructions per clock cycle", instructions per clock cycle can be poor metric to use when comparing fused and unfused instructions. Previously, I'd naively thought that once the instructions were decoded to the DSB, that it didn't matter whether one expressed LOAD-OP as a single instruction, or as a separate LOAD then OP.

But if one is being constrained by the Renamer, it turns out that it can make a big difference in total execution time. For example, I'm finding that in a tight loop, this (two combined load-adds):

#define ASM_ADD_ADD_INDEX(in, sum1, sum2, index) \
__asm volatile ("add 0x0(%[IN], %[INDEX]), %[SUM1]\n" \
"add 0x8(%[IN], %[INDEX]), %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index))


Is about 20% faster than this (two separate loads and adds):

#define ASM_LOAD_LOAD_INDEX(in, sum1, sum2, index, tmp) \
__asm volatile ("mov 0x0(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM1]\n" \
"mov 0x8(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index), \
[TMP] "=r" (tmp))

While the hybrid (one and one) is the same speed as the fast version:

#define ASM_LOAD_ADD_INDEX(in, sum1, sum2, index, tmp) \
__asm volatile ("mov 0x0(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM1]\n" \
"add 0x8(%[IN], %[INDEX]), %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index), \
[TMP] "=r" (tmp))


What I don't understand yet is why all variations that directly increment %[IN] are almost twice as slow as the versions that use and increment %[INDEX]:

#define ASM_ADD_ADD_DIRECT(in, sum1, sum2) \
__asm volatile ("add 0x0(%[IN]), %[SUM1]\n" \
"add 0x8(%[IN]), %[SUM2]\n" \
"add $0x10, %[IN]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2))

I also don't understand yet why I get 30% faster speeds for loops small enough to fit in the LSD than when unrolled such that the number of µops requires the DSB. Apparently the Loop Stream Detector still plays a performance roll in some cases.

   
Instruction Throughput on Skylake
Author: Agner Date: 2016-04-27 01:14
Nathan Kurz wrote:
The bottleneck appears to be the "renamer", which can only "issue" 4 µops per cycle.
I think the decoding front end and the renamer are designed with a 4-wide pipeline for a throughput of four µops per clock. These µops are queuing up in the reservation station if execution of them is delayed for any reason. The scheduler can issue more than 4 µops per clock cycle in bursts until the queue is empty.

I also don't understand yet why I get 30% faster speeds for loops small enough to fit in the LSD than when unrolled such that the number of µops requires the DSB. Apparently the Loop Stream Detector still plays a performance roll in some cases.
Instruction fetch and decode is often a bottleneck - you need to check the instruction lengths. Alignment of the loop entry can also influence the results. Finally, you will often see cache effects influencing the results in a less than obvious way.
   
Instruction Throughput on Skylake
Author:  Date: 2016-06-18 19:27
When you say:

> I think the decoding front end and the renamer are designed with a 4-wide pipeline for a throughput of four µops per clock.

Are you talking fused domain or unfused domain µops? Here I'm only interested in micro-fusion. Let's assume there are no opportunities for macro-fusion. If that's 4-wide in the fused domain, it implies that the processor could sustain 6 µops throughput in the unfused domain, if there are no 4 (or 5) wide bottlenecks downstream of the scheduler (e.g., issue or retirement). That would be a big deal since it implies that read-modify instructions may be highly preferred in many scenarios over separate two separate load, op r/r instructions.

Hi Nathan,

Are you able to share your results about 5 or 6 wide throughput? You hinted at them in your post, but anything reproducible would be great.

T

   
Instruction Throughput on Skylake
Author: Agner Date: 2016-06-19 00:59
T wrote:
If that's 4-wide in the fused domain, it implies that the processor could sustain 6 µops throughput in the unfused domain, if there are no 4 (or 5) wide bottlenecks downstream of the scheduler (e.g., issue or retirement).
Yes, it can do 6 µops in the unfused domain.
   
Instruction Throughput on Skylake
Author:  Date: 2016-07-08 02:50
T wrote:

Are you able to share your results about 5 or 6 wide throughput? You hinted at them in your post, but anything reproducible would be great.

All sharable, but I haven't been thinking about this direction for a couple months. I'll try to post something here if I can dig it up, but I won't be able to get to it immediately.

But if my recollection is correct, the short answer is that yes, Read-Modify instructions should almost always be used as heavily as possible for inner loops on modern Intel processors. They have significant upside if you would otherwise be limited by the renamer.

And while you say you are not interested in it, the corollary for micro-fusion is that CMP-JCC instructions should almost always be adjacent in assembly. I'm pretty sure that both GCC and LLVM would benefit from putting a higher penalty on the split.

   
Instruction Throughput on Skylake
Author:  Date: 2016-07-11 22:21
OK, here's my cleaned up test code.

// gcc -g -Wall -O2 fusion.c -o fusion -DLIKWID -llikwid [may also need -lm -lpthread]
// likwid-perfctr -m -g UOPS_ISSUED_ANY:PMC0,UOPS_EXECUTED_CORE:PMC1,UOPS_RETIRED_ALL:PMC2,BR_INST_RETIRED_NEAR_TAKEN:PMC3 -C 1 fusion

#include <x86intrin.h> #include <stdint.h> #include <stdio.h>

#ifdef LIKWID #include <likwid.h> #define MEASURE_INIT() \ do { \ likwid_markerInit(); \ likwid_markerThreadInit(); \ } while (0) #define MEASURE_FINI() \ do { \ likwid_markerClose(); \ } while (0) #define MEASURE(name, code) \ do { \ sum1 = sum2 = 0; \ likwid_markerStartRegion(name); \ code; \ likwid_markerStopRegion(name); \ printf("%s: sum1=%ld, sum2=%ld\n", name, sum1, sum2); \ } while (0) #else // not LIKWID #define MEASURE_INIT() #define MEASURE_FINI() #define MEASURE(name, code) \ do { \ sum1 = sum2 = 0; \ code; \ printf("%s: sum1=%ld, sum2=%ld\n", name, sum1, sum2); \ } while (0) #endif // not LIKWID

#define ASM_TWO_MICRO_TWO_MACRO(in1, sum1, in2, sum2, max) \ __asm volatile ("1:\n" \ "add (%[IN1]), %[SUM1]\n" \ "cmp %[MAX], %[SUM1]\n" \ "jae 2f\n" \ "add (%[IN2]), %[SUM2]\n" \ "cmp %[MAX], %[SUM2]\n" \ "jb 1b\n" \ "2:" : \ [SUM1] "+&r" (sum1), \ [SUM2] "+&r" (sum2) : \ [IN1] "r" (in1), \ [IN2] "r" (in2), \ [MAX] "r" (max))

#define ASM_NO_MICRO_TWO_MACRO(in1, sum1, in2, sum2, max, tmp1, tmp2) \ __asm volatile ("1:\n" \ "mov (%[IN1]), %[TMP1]\n" \ "add %[TMP1], %[SUM1]\n" \ "cmp %[MAX], %[SUM1]\n" \ "jae 2f\n" \ "mov (%[IN2]), %[TMP2]\n" \ "add %[TMP2], %[SUM2]\n" \ "cmp %[MAX], %[SUM2]\n" \ "jb 1b\n" \ "2:" : \ [TMP1] "=&r" (tmp1), \ [TMP2] "=&r" (tmp2), \ [SUM1] "+&r" (sum1), \ [SUM2] "+&r" (sum2) : \ [IN1] "r" (in1), \ [IN2] "r" (in2), \ [MAX] "r" (max))

#define ASM_ONE_MICRO_TWO_MACRO(in1, sum1, in2, sum2, max, tmp) \ __asm volatile ("1:\n" \ "add (%[IN1]), %[SUM1]\n" \ "cmp %[MAX], %[SUM1]\n" \ "jae 2f\n" \ "mov (%[IN2]), %[TMP]\n" \ "add %[TMP], %[SUM2]\n" \ "cmp %[MAX], %[SUM2]\n" \ "jb 1b\n" \ "2:" : \ [TMP] "=&r" (tmp), \ [SUM1] "+&r" (sum1), \ [SUM2] "+&r" (sum2) : \ [IN1] "r" (in1), \ [IN2] "r" (in2), \ [MAX] "r" (max))

#define ASM_ONE_MICRO_ONE_MACRO(in1, sum1, in2, sum2, max, tmp) \ __asm volatile ("1:\n" \ "add (%[IN1]), %[SUM1]\n" \ "cmp %[MAX], %[SUM1]\n" \ "mov (%[IN1]), %[TMP]\n" \ "jae 2f\n" \ "add %[TMP], %[SUM2]\n" \ "cmp %[MAX], %[SUM2]\n" \ "jb 1b\n" \ "2:" : \ [TMP] "=&r" (tmp), \ [SUM1] "+&r" (sum1), \ [SUM2] "+&r" (sum2) : \ [IN1] "r" (in1), \ [IN2] "r" (in2), \ [MAX] "r" (max))

// two separate loads and adds, two non-fused cmp then jcc #define ASM_NO_MICRO_NO_MACRO(in1, sum1, in2, sum2, max, tmp1, tmp2) \ __asm volatile ("mov (%[IN1]), %[TMP1]\n" \ "1:\n" \ "add %[TMP1], %[SUM1]\n" \ "cmp %[MAX], %[SUM1]\n" \ "mov (%[IN2]), %[TMP2]\n" \ "jae 2f\n" \ "add %[TMP2], %[SUM2]\n" \ "cmp %[MAX], %[SUM2]\n" \ "mov (%[IN1]), %[TMP1]\n" \ "jb 1b\n" \ "2:" : \ [TMP1] "=&r" (tmp1), \ [TMP2] "=&r" (tmp2), \ [SUM1] "+&r" (sum1), \ [SUM2] "+&r" (sum2) : \ [IN1] "r" (in1), \ [IN2] "r" (in2), \ [MAX] "r" (max))

int main(/* int argc, char **argv */) { uint64_t tmp, tmp1, tmp2; uint64_t sum1, sum2; uint64_t in1 = 1; uint64_t in2 = 1; uint64_t max = 10000000;

MEASURE_INIT();

MEASURE("two_micro_two_macro", ASM_TWO_MICRO_TWO_MACRO(&in1, sum1, &in2, sum2, max));

MEASURE("one_micro_two_macro", ASM_ONE_MICRO_TWO_MACRO(&in1, sum1, &in2, sum2, max, tmp));

MEASURE("one_micro_one_macro", ASM_ONE_MICRO_ONE_MACRO(&in1, sum1, &in2, sum2, max, tmp));

MEASURE("no_micro_two_macro", ASM_NO_MICRO_TWO_MACRO(&in1, sum1, &in2, sum2, max, tmp1, tmp2));

MEASURE("no_micro_no_macro", ASM_NO_MICRO_NO_MACRO(&in1, sum1, &in2, sum2, max, tmp1, tmp2));

MEASURE_FINI();

return 0; }

And here's what I see on Skylake:

nate@skylake:~/src$ likwid-perfctr -m -g UOPS_ISSUED_ANY:PMC0,UOPS_EXECUTED_CORE:PMC1,UOPS_RETIRED_ALL:PMC2,BR_INST_RETIRED_NEAR_TAKEN:PMC3 -C 1 fusion
CPU name:	Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
CPU type:	Intel Skylake processor
CPU clock:	3.41 GHz
--------------------------------------------------------------------------------
two_micro_two_macro: sum1=10000000, sum2=9999999
one_micro_two_macro: sum1=10000000, sum2=9999999
one_micro_one_macro: sum1=10000000, sum2=9999999
no_micro_two_macro: sum1=10000000, sum2=9999999
no_micro_no_macro: sum1=10000000, sum2=9999999
--------------------------------------------------------------------------------
================================================================================
Group 1 Custom: Region two_micro_two_macro
================================================================================
|       UOPS_ISSUED_ANY      |   PMC0  | 4.000816e+07 |
|     UOPS_EXECUTED_CORE     |   PMC1  | 6.000806e+07 |
|      UOPS_RETIRED_ALL      |   PMC2  | 6.000724e+07 |
| BR_INST_RETIRED_NEAR_TAKEN |   PMC3  | 1.000056e+07 |
|      INSTR_RETIRED_ANY     |  FIXC0  | 6.000540e+07 |
|    CPU_CLK_UNHALTED_CORE   |  FIXC1  | 1.001363e+07 |
================================================================================
Group 1 Custom: Region one_micro_two_macro
================================================================================
|       UOPS_ISSUED_ANY      |   PMC0  | 5.000502e+07 |
|     UOPS_EXECUTED_CORE     |   PMC1  | 6.000506e+07 |
|      UOPS_RETIRED_ALL      |   PMC2  | 6.000471e+07 |
| BR_INST_RETIRED_NEAR_TAKEN |   PMC3  | 1.000040e+07 |
|      INSTR_RETIRED_ANY     |  FIXC0  | 7.000316e+07 |
|    CPU_CLK_UNHALTED_CORE   |  FIXC1  | 1.334216e+07 |
================================================================================
Group 1 Custom: Region one_micro_one_macro
================================================================================
|       UOPS_ISSUED_ANY      |   PMC0  | 6.000435e+07 |
|     UOPS_EXECUTED_CORE     |   PMC1  | 7.000444e+07 |
|      UOPS_RETIRED_ALL      |   PMC2  | 7.000445e+07 |
| BR_INST_RETIRED_NEAR_TAKEN |   PMC3  | 1.000039e+07 |
|      INSTR_RETIRED_ANY     |  FIXC0  | 7.000310e+07 |
|    CPU_CLK_UNHALTED_CORE   |  FIXC1  | 1.672351e+07 |
================================================================================
Group 1 Custom: Region no_micro_two_macro
================================================================================
|       UOPS_ISSUED_ANY      |   PMC0  | 6.000429e+07 |
|     UOPS_EXECUTED_CORE     |   PMC1  | 6.000438e+07 |
|      UOPS_RETIRED_ALL      |   PMC2  | 6.000438e+07 |
| BR_INST_RETIRED_NEAR_TAKEN |   PMC3  | 1.000039e+07 |
|      INSTR_RETIRED_ANY     |  FIXC0  | 8.000307e+07 |
|    CPU_CLK_UNHALTED_CORE   |  FIXC1  | 1.500636e+07 |
================================================================================
Group 1 Custom: Region no_micro_no_macro
================================================================================
|       UOPS_ISSUED_ANY      |   PMC0  | 8.000476e+07 |
|     UOPS_EXECUTED_CORE     |   PMC1  | 8.000483e+07 |
|      UOPS_RETIRED_ALL      |   PMC2  | 8.000466e+07 |
| BR_INST_RETIRED_NEAR_TAKEN |   PMC3  | 1.000039e+07 |
|      INSTR_RETIRED_ANY     |  FIXC0  | 8.000312e+07 |
|    CPU_CLK_UNHALTED_CORE   |  FIXC1  | 2.000775e+07 |

And on Haswell:

nate@haswell:~/src$ likwid-perfctr -m -g UOPS_ISSUED_ANY:PMC0,UOPS_EXECUTED_CORE:PMC1,UOPS_RETIRED_ALL:PMC2,BR_INST_RETIRED_NEAR_TAKEN:PMC3 -C 1 fusion
-------------------------------------------------------------
-------------------------------------------------------------
CPU type:	Intel Core Haswell processor
CPU clock:	3.39 GHz
-------------------------------------------------------------
fusion
two_micro_two_macro: sum1=10000000, sum2=9999999
one_micro_two_macro: sum1=10000000, sum2=9999999
one_micro_one_macro: sum1=10000000, sum2=9999999
no_micro_two_macro: sum1=10000000, sum2=9999999
no_micro_no_macro: sum1=10000000, sum2=9999999
=====================
Region: two_micro_two_macro
=====================
|      UOPS_ISSUED_ANY       | 4.00061e+07 |
|     UOPS_EXECUTED_CORE     | 6.00062e+07 |
|      UOPS_RETIRED_ALL      | 6.00046e+07 |
| BR_INST_RETIRED_NEAR_TAKEN | 1.00002e+07 |
|     INSTR_RETIRED_ANY      | 6.00013e+07 |
|   CPU_CLK_UNHALTED_CORE    | 1.7392e+07  |
=====================
Region: one_micro_two_macro
=====================
+----------------------------+-------------+
|           Event            |   core 1    |
+----------------------------+-------------+
|      UOPS_ISSUED_ANY       | 5.00062e+07 |
|     UOPS_EXECUTED_CORE     | 6.00062e+07 |
|      UOPS_RETIRED_ALL      | 6.00046e+07 |
| BR_INST_RETIRED_NEAR_TAKEN | 1.00002e+07 |
|     INSTR_RETIRED_ANY      | 7.00013e+07 |
|   CPU_CLK_UNHALTED_CORE    | 1.4247e+07  |
=====================
Region: one_micro_one_macro
=====================
+----------------------------+-------------+
|           Event            |   core 1    |
+----------------------------+-------------+
|      UOPS_ISSUED_ANY       | 6.00065e+07 |
|     UOPS_EXECUTED_CORE     | 7.00065e+07 |
|      UOPS_RETIRED_ALL      | 7.00048e+07 |
| BR_INST_RETIRED_NEAR_TAKEN | 1.00002e+07 |
|     INSTR_RETIRED_ANY      | 7.00013e+07 |
|   CPU_CLK_UNHALTED_CORE    | 1.69403e+07 |
=====================
Region: no_micro_two_macro
=====================
+----------------------------+-------------+
|           Event            |   core 1    |
+----------------------------+-------------+
|      UOPS_ISSUED_ANY       | 6.00062e+07 |
|     UOPS_EXECUTED_CORE     | 6.00062e+07 |
|      UOPS_RETIRED_ALL      | 6.00046e+07 |
| BR_INST_RETIRED_NEAR_TAKEN | 1.00002e+07 |
|     INSTR_RETIRED_ANY      | 8.00013e+07 |
|   CPU_CLK_UNHALTED_CORE    | 1.57365e+07 |
=====================
Region: no_micro_no_macro
=====================
|      UOPS_ISSUED_ANY       | 8.00062e+07 |
|     UOPS_EXECUTED_CORE     | 8.00062e+07 |
|      UOPS_RETIRED_ALL      | 8.00046e+07 |
| BR_INST_RETIRED_NEAR_TAKEN | 1.00002e+07 |
|     INSTR_RETIRED_ANY      | 8.00013e+07 |
|   CPU_CLK_UNHALTED_CORE    | 2.0043e+07  |
+----------------------------+-------------+

The main thing to notice is that on Skylake the "two macro two micro" is fastest and executes at 1 cycle per iteration, while on Haswell is it slower than than a couple options with less fusion. BR_INST_RETIRED_NEAR_TAKEN is to show the number of loop iterations. Run time in cycles is shown by CPU_CLK_UNHALTED_CORE. The difference between INSTR_RETIRED_ANY and UOPS_RETIRED_ALL shows the effect of macro-fusion of CMP/JCC. The difference between UOPS_ISSUED_ANY and UOPS_EXECUTED_CORE shows the effect of micro-fusion of LOAD/ADD. UOPS_EXECUTED_CORE and UOPS_RETIRED_CORE are the same on both machines, showing that there is no branch misprediction occurring.

   
Instruction Throughput on Skylake
Author:  Date: 2016-07-17 14:14
Interesting results. Looks like it's about the number of renamed registers. Apparently, Hwl had lower TP restriction for renamer, and it was upgraded for Skl. This explains faster case for Hwl (more µops with less arguments each, but only up to certain point). Peak issue rate is still 4 fused µIPC from IDQ to rename, but 6 unfused µIPC (correspondent to up to 6 IPC) for retire for Skl. Hwl can't allow more than 5 unfused µIPC.
   
Instruction Throughput on Skylake
Author: T Date: 2016-08-08 01:57
Thank you very much for that. It is really interesting and implied that compilers and assembly writers should tune differently for Haswell vs Skylake. I wonder if icc has been updated to reflect it?
   
Unlamination of micro-fused ops in SKL and earlier
Author:  Date: 2016-09-09 19:36
There is an interesting effect which changed in Skylake (or at least some architecture after Sandy Bridge, up to and including Skylake), but isn't covered in your manual. It concerns the behavior of micro-fused instructions with *complex* memory source or destination operands. Here complex means with base and index registers, so something like

add rax, [rbx + rcx]

In Sandy Bridge, this doesn't seem to micro-fuse in the same way as simpler addressing modes such as:

add rax, [rbx + 16]

In particular, while it seems that the complex address modes fuse in the uop cache, the constituent ops are later "unlaminated" and consume rename and retirement resources. In particular, this means that you cannot achieve 4 micro-fused uops/cycle throughput with these addressing modes. The Intel optimization doc does touch on it briefly in 2.3.2.4 Micro-op Queue and the Loop Stream Detector (LSD):

In particular, loads combined with computational operations and all stores, when used
with indexed addressing, are represented as a single micro-op in the decoder or Decoded ICache.
In the micro-op queue they are fragmented into two micro-ops through a process called un-lamination,
one does the load and the other does the operation. A typical example is the following "load plus operation"
instruction:

ADD RAX, [RBP+RSI]; rax := rax + LD( RBP+RSI )

The Intel section is a bit unclear because they don't make it very explicit obvious that this only applies to indexed addressing modes, and that if you don't use index addressing you potentially achieve higher throughput.

This issue could be pretty critical for optimization of high IPC loops, on a par with many similar issues covered in your doc. In particular, it means jumping through a few hoops to be able to use a simpler addressing mode could be worth it - beyond the latency benefits already documented in your guide (and beyond the ability to use port 7 AGU for store address calculation as well).

It might be nice to add it to your doc! There is an extensive investigation on this stackoverflow question, which is what prompted me to post here . See in particular the answer from Peter Cordes who shows the issue on Sandy Bridge. In another answer I have some tests that show the limitation is removed on Skylake, but we don't know exactly in which arch it was removed. The Intel doc is mostly silent on that topic (unlamination is only discussed in the one SB-specific section I linked above). If you have some other machines at your disposal I have some code here that makes it easy to test the behavior (on Linux).