Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Test results for Broadwell and Skylake
Author: Agner Date: 2015-12-26 08:27
The optimization manuals at www.agner.org/optimize/#manuals have been updated. I have now tested the Intel Broadwell and Skylake processors. I have not tested the AMD Excavator and Puma because I cannot find suitable motherboards for testing them.

The test results show that the pipeline and execution units in Broadwell is very similar to its predecessor Haswell, while the Skylake has been reorganized a little.

The Skylake has a somewhat improved cache throughput and supports the new DDR4 RAM. This is important since RAM access is the bottleneck in many applications. On the other hand, the Skylake has reduced the level-2 cache associativity from 8 to 4.

Floating point division has been improved a little in Broadwell and integer division has been improved a little in Skylake. Gather instructions, which are used for collecting non-contiguous data from memory and joining them into a vector register, are improved somewhat in Broadwell, and a little more in Skylake. This makes it more efficient to collect data into vector registers.

Ever since the first Intel processor with out-of-order execution was released in 1995, there has been a limitation that no micro-operation could have more than two input dependencies. This meant that instructions with more than two input dependencies were split into two or more micro-operations. The introduction of fused multiply-and-add (FMA) instructions in Haswell made it necessary to overcome this limitation. Thus, the FMA instructions were the first instructions to be implemented with micro-operations with three input dependencies in an Intel processor. Once this limitation has been broken, the new capability can also be applied to other instructions. The Broadwell has extended the capability for three-input micro-operations to add-with-carry, subtract-with-borrow and conditional move instructions. The Skylake has extended it further to a blend instruction. AMD processors have never had this limitation of two input dependencies. Perhaps this is the reason why AMD came before Intel with FMA instructions.

The Haswell and Broadwell have two execution units for floating point multiplication and FMA, but only one for addition. This is odd since most floating point code has more additions than multiplications. To get the maximum floating point throughput on these processors, one might have to replace some additions with FMA instructions with a multiplier of 1. Fortunately, the Skylake has fixed this imbalance and made two floating point arithmetic units, both of which can handle both addition, multiplication and FMA. This gives a maximum throughput of two floating point vector operations per clock cycle.

The Skylake has increased the number of execution units for integer vector arithmetic from two to three. In general, the Skylake now has multiple execution units for almost all common operations (except memory write and data permutations). This means that an instruction or micro-operation rarely has to wait for a vacant execution unit. A throughput of four instructions per clock cycle is now a realistic goal for CPU-intensive code, unless the software contains long dependency chains. All arithmetic and logic units support vectors of up to 256 bits. The anticipated support for 512-bit vectors with the AVX-512 instruction set has been postponed to 2016 or 2017.

Intel's design has traditionally tried to standardize operation latencies, i. e. the number of clock cycles that a micro-operation takes. Operations with the same latencies were organized under the same execution port in order to avoid a clash when operations that start at different times would finish at the same time and so need the result bus at the same time. The Skylake microarchitecture has been improved to allow operations with several different latencies under the same execution port. There is still some standardization of latencies left, though. All floating point additions, multiplications and FMA operations have a latency of 4 clock cycles on Skylake, while these were 3 and 5 on previous processors.

Store forwarding is one clock cycle faster on Skylake than on previous processors. Store forwarding is the time it takes to read from a memory address immediately after writing to the same address.

Previous Intel processors have different states for code that use the AVX instruction sets allowing 256-bit vectors versus legacy code with 128-bit vectors and no VEX prefixes. The Sandy Bridge, Ivy Bridge, Haswell and Broadwell processors all have these states and a serious penalty of 70 clock cycles for state switching when a piece of code accidentally mixed VEX and non-VEX instructions. This annoying state shift and penalty has been eliminated on the Skylake. Apparently, the implementation of 256-bit registers has become more streamlined.

I observed an interesting phenomenon when executing 256-bit vector instructions on the Skylake. There is a warm-up period of approximately 14 µs before it can execute 256-bit vector instructions at full speed. Apparently, the upper 128-bit half of the execution units and data buses is turned off in order to save power when it is not used. As soon as the processor sees a 256-bit instruction it starts to power up the upper half. It can still execute 256-bit instructions during the warm-up period, but it does so by using the lower 128-bit units twice for every 256-bit vector. The result is that the throughput for 256-bit vectors is 4-5 times slower during this warm-up period. If you know in advance that you will need to use 256-bit instructions soon, then you can start the warm-up process by placing a dummy 256-bit instruction at a strategic place in the code. My measurements showed that the upper half of the units is shut down again after 675 µs of inactivity.

This warm-up phenomenon has reportedly been observed in previous processors as well (see agner.org/optimize/blog/read.php?i=378#378), but I have not observed it before in any of the processors that I have tested. Perhaps some high-end versions of Intel processors have this ability to shut down the upper 128-bit lane in order to save power, while other variants of the same processors have no such feature. This is something that needs further investigation.

 
thread Test results for Broadwell and Skylake - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake new - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake new - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual new - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual new - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake new - Jess - 2016-02-11
last reply Description of discrepancy new - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake new - T - 2016-06-18
reply Instruction Throughput on Skylake new - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake new - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B new - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - - - 2017-06-19
replythread Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake new - - - 2017-07-05
last replythread Test results for Broadwell and Skylake new - - - 2017-07-12
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake new - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake new - Travis - 2017-07-13
last reply Official information about uOps and latency SNB+ new - SEt - 2017-07-17