Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Test results for Broadwell and Skylake
Author:  Date: 2017-05-30 12:31
more than year ago, i wrote you that skylake will have single-issue avx-512. just a bit more detais why i lead tio this conclusion (partly copied from my post on anand):

i can give you details about avx-512 - they are pretty obvious from analysis of skylake execution ports. so

1) avx-512 is mainly single-issue. all the avx commands that now are supported BOTH on port 0 & port 1, will become avx-512 commands supported on joined port 0+1

2) a few commands that are supported only on port 5 (this are various bit shuffles), will be also single-issued in avx-512, which still means doubled perfromance - from single-issued avx-256 to single-issued avx-512

3) a few commands that can be issued on any of 3 ports (0,1,5), including booleans and add/sub/cmp - so-lcalled PADD group, will be double-issued in avx-512, so they will get 33% uplift

overall, ports 0&1 will join when executing 512-bit commands, while port 5 is extended to 512-bit operands. joined port 0&1 can execute almost any avx-512 command, except for a bit shuffle ones, port 5 can execute bit shuffles and PADD group

---------

when going from sse to avx, intel sacrificed easy of programming for easy of hardware implemenation, resulting in almost full lack of commands that can exchange data between upper&lower parts of ymm register (so-called lanes). avx-512 was done right, but this means that bit shuffle commands require a full 512-bit mesh. so, intel moved all these commands to port 5 making it an only full 512 bit port, while most remaining commands were moved into ports 0&1 where 512-bit command can be implemented as simple pair of 256-bit ones

looking at power budgets, it's obvious that simple doubling of execution resources (i.e. support of 512 bit commands instead of 256-bit ones) is impossible. in previous cpu generation, even avx commands increased energy usage by 40%, so it's easy to predict that extending each executed command to 512 bits will require another 80% increase

also, it's easy to compare skylake with broadwell and see many strange changes:

1) intel m/a implements SIMD commands on ports 0/1/5 and usually tries to equally spread commands among these 3 ports to increase final perfromance. but skylake is much more asymmetric in that regard - it implements all but but bit shuffle commands on ports 0 & 1

2) skylake tries to implement commands BOTH on ports 0 & 1 with maniacal diligence, including such rarely-used commands as PMUL and PCMPGTQ. as result, PCMPGTQ throughput was quadrupled! and PMUL now supported by 2 ports while scalar MUL only on one. You will fimnd many more examples, while only extremely expensive commands like divsion doesn't got the doubled throughput

3) when intel added avx/avx2 in SB/HW, it decreased throughput of some commands - f.e. Nehalem had double-issue both for bit shuffle and bit-combine commands, while SB/HW reduced their throughput to 1. So, if skylake are going to add avx-512 suppor, it may be expected that it will do the same (i.e. reduce thrpughput of rarely used commands), again to reduce power/transistor budget. But in practice, it doubled throughput of many commands while keeping single-issue throughput of shuffles. Idea that ports 0&1 will co-execute 512-bit commands, while port 5 will extend all its commands to 512 bits, excellently explains why it was made, while idea that everything will be just extended to 512 bits, fails miserable

so, once i read Intel optimization manual, and thought a while, it became obvious. Moreover, i believe that skylake implemented all 4 ISA extensions that Intel was marketed (sgx/mpx/sha/avx3) but they were not enabled earlier due to marketing/market-slicing requirements. Intel just need a counter-weapon against Ryzen, so it doesn't show all Skylake sthrength in 2015 when its posiition was already strong

---------

of course, m/a analysis can't say anything about commands absent in avx2 set, so my guess that predicate register manipulations will also go to port 5, just to make the m/a a bit less asymmetric

also it's easy to predict that in the next generations the first "improvement" will be to add FMAD capability to port 5, further doubling the marketing perfromance figures

 
thread Test results for Broadwell and Skylake new - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake new - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake new - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual new - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual new - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake new - Jess - 2016-02-11
last reply Description of discrepancy new - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake new - T - 2016-06-18
reply Instruction Throughput on Skylake new - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake new - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B new - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - - - 2017-06-19
replythread Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake new - - - 2017-07-05
last replythread Test results for Broadwell and Skylake new - - - 2017-07-12
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake new - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake new - Travis - 2017-07-13
last reply Official information about uOps and latency SNB+ new - SEt - 2017-07-17