Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

List Messageboards

Test results for Broadwell and Skylake

Author: Date: 2015-12-28 06:19

Thanks for your excellent work on the instruction tables and microarchitecture guide.

Agner wrote:

This annoying state shift and penalty has been eliminated on the Skylake. Apparently, the implementation of 256-bit registers has become more streamlined.

I wonder if the performance penalty has been replaced with a power-consumption penalty. Perhaps there's still a "state C" where Skylake uses more power? The performance penalty on the earlier CPUs ensures most software will still avoid this. I don't think this is very likely; probably they came up with some clever way to avoid penalties except maybe when forwarding results from a non-VEX op to a 256b op (over the bypass network).

Do 128b non-VEX ops have a "false" dependency on the upper128 of a register? Is there a latency penalty when a 256b insn reads a ymm register last written by a non-VEX insn (or an extra uop to merge the xmm into the ymm)?

More importantly, is VZEROUPPER helpful in any way on Skylake? (Obviously this is a bad idea for binaries that might be run on older CPUs).

There is one use-case for mixing VEX and non-VEX : PBLENDVB x,x,xmm0 is 1 uop, p015. VPBLENDVB v,v,v,v is 2 uops, 2p015, and 2c latency. I'm picturing a function that needs to do a lot of blends, and but can also benefit from using 3-operand non-destructive VEX insns, except for non-VEX PBLENDVB.

Also: I remember reading something in a realworldtech forum thread about wider uop fetch in Skylake. (The forum isn't searchable, so I prob. can't find it now). Is there any improvement in the frontend for loops that don't fit in the loop buffer? I was hoping Skylake would fetch whole uop cache lines (up to 6 uops) per clock, and put them into a small buffer to more consistently issue 4 fused-domain uops per clock.

I've considered trying to align / re-ordering insns for uop-cache throughput in a loop that didn't quite fit in the loop buffer. I saw performance differences (on SnB) from reordering, but I never went beyond trial and error. I don't have an editor that shows the assembled binary updated on the fly as source edits are made, let alone with 32B boundaries marked and uops grouped into cache lines, so it would have been very time consuming.

Reply To This Message

Previous Message

Test results for Broadwell and Skylake new - Agner - 2015-12-26

Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26

Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27

Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27

Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04

Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18

Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12

Test results for Broadwell and Skylake - Peter Cordes - 2015-12-28

Test results for Broadwell and Skylake new - Agner - 2015-12-29

Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04

Test results for Broadwell and Skylake new - Agner - 2016-01-05

Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09

Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05

Minor bug in the microarchitecture manual new - SHK - 2016-01-10

Minor bug in the microarchitecture manual new - Agner - 2016-01-16

Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12

Test results for Broadwell and Skylake new - Jess - 2016-02-11

Description of discrepancy new - Nathan Kurz - 2016-03-13

Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22

Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23

Instruction Throughput on Skylake new - Agner - 2016-04-24

Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26

Instruction Throughput on Skylake new - Agner - 2016-04-27

Instruction Throughput on Skylake new - T - 2016-06-18

Instruction Throughput on Skylake new - Agner - 2016-06-19

Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08

Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11

Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17

Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11

Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11

Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12

Instruction Throughput on Skylake new - T - 2016-08-08

Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09

32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11

32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28

32B store-forwarding is slower than 16B new - Agner - 2017-06-28

SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30

Test results for Broadwell and Skylake new - Agner - 2017-05-30

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30

Test results for Broadwell and Skylake new - - - 2017-06-19

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26

Test results for Broadwell and Skylake new - - - 2017-07-05

Test results for Broadwell and Skylake new - - - 2017-07-12

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19

Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28

Test results for Broadwell and Skylake new - Travis - 2017-06-29

Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30

Test results for Broadwell and Skylake new - Travis - 2017-07-13

Official information about uOps and latency SNB+ new - SEt - 2017-07-17

Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07

Test results for Broadwell and Skylake new - Agner - 2020-10-11

List Messageboards