Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Test results for Broadwell and Skylake
Author:  Date: 2015-12-28 06:19
Thanks for your excellent work on the instruction tables and microarchitecture guide.

Agner wrote:

This annoying state shift and penalty has been eliminated on the Skylake. Apparently, the implementation of 256-bit registers has become more streamlined.

I wonder if the performance penalty has been replaced with a power-consumption penalty. Perhaps there's still a "state C" where Skylake uses more power? The performance penalty on the earlier CPUs ensures most software will still avoid this. I don't think this is very likely; probably they came up with some clever way to avoid penalties except maybe when forwarding results from a non-VEX op to a 256b op (over the bypass network).

Do 128b non-VEX ops have a "false" dependency on the upper128 of a register? Is there a latency penalty when a 256b insn reads a ymm register last written by a non-VEX insn (or an extra uop to merge the xmm into the ymm)?

More importantly, is VZEROUPPER helpful in any way on Skylake? (Obviously this is a bad idea for binaries that might be run on older CPUs).

There is one use-case for mixing VEX and non-VEX : PBLENDVB x,x,xmm0 is 1 uop, p015. VPBLENDVB v,v,v,v is 2 uops, 2p015, and 2c latency. I'm picturing a function that needs to do a lot of blends, and but can also benefit from using 3-operand non-destructive VEX insns, except for non-VEX PBLENDVB.

Also: I remember reading something in a realworldtech forum thread about wider uop fetch in Skylake. (The forum isn't searchable, so I prob. can't find it now). Is there any improvement in the frontend for loops that don't fit in the loop buffer? I was hoping Skylake would fetch whole uop cache lines (up to 6 uops) per clock, and put them into a small buffer to more consistently issue 4 fused-domain uops per clock.

I've considered trying to align / re-ordering insns for uop-cache throughput in a loop that didn't quite fit in the loop buffer. I saw performance differences (on SnB) from reordering, but I never went beyond trial and error. I don't have an editor that shows the assembled binary updated on the fly as source edits are made, let alone with 32B boundaries marked and uops grouped into cache lines, so it would have been very time consuming.

 
thread Test results for Broadwell and Skylake new - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake new - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake new - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual new - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual new - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake new - Jess - 2016-02-11
last reply Description of discrepancy new - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake new - T - 2016-06-18
reply Instruction Throughput on Skylake new - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake new - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B new - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - - - 2017-06-19
replythread Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake new - - - 2017-07-05
last replythread Test results for Broadwell and Skylake new - - - 2017-07-12
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake new - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake new - Travis - 2017-07-13
reply Official information about uOps and latency SNB+ new - SEt - 2017-07-17
last replythread Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07
last reply Test results for Broadwell and Skylake new - Agner - 2020-10-11