more than year ago, i wrote you that skylake will have single-issue avx-512. just a bit more detais why i lead tio this conclusion (partly copied from my post on anand): i can give you details about avx-512 - they are pretty obvious from analysis of skylake execution ports. so 1) avx-512 is mainly single-issue. all the avx commands that now are supported BOTH on port 0 & port 1, will become avx-512 commands supported on joined port 0+1 2) a few commands that are supported only on port 5 (this are various bit shuffles), will be also single-issued in avx-512, which still means doubled perfromance - from single-issued avx-256 to single-issued avx-512 3) a few commands that can be issued on any of 3 ports (0,1,5), including booleans and add/sub/cmp - so-lcalled PADD group, will be double-issued in avx-512, so they will get 33% uplift overall, ports 0&1 will join when executing 512-bit commands, while port 5 is extended to 512-bit operands. joined port 0&1 can execute almost any avx-512 command, except for a bit shuffle ones, port 5 can execute bit shuffles and PADD group --------- when going from sse to avx, intel sacrificed easy of programming for easy of hardware implemenation, resulting in almost full lack of commands that can exchange data between upper&lower parts of ymm register (so-called lanes). avx-512 was done right, but this means that bit shuffle commands require a full 512-bit mesh. so, intel moved all these commands to port 5 making it an only full 512 bit port, while most remaining commands were moved into ports 0&1 where 512-bit command can be implemented as simple pair of 256-bit ones looking at power budgets, it's obvious that simple doubling of execution resources (i.e. support of 512 bit commands instead of 256-bit ones) is impossible. in previous cpu generation, even avx commands increased energy usage by 40%, so it's easy to predict that extending each executed command to 512 bits will require another 80% increase also, it's easy to compare skylake with broadwell and see many strange changes: 1) intel m/a implements SIMD commands on ports 0/1/5 and usually tries to equally spread commands among these 3 ports to increase final perfromance. but skylake is much more asymmetric in that regard - it implements all but but bit shuffle commands on ports 0 & 1 2) skylake tries to implement commands BOTH on ports 0 & 1 with maniacal diligence, including such rarely-used commands as PMUL and PCMPGTQ. as result, PCMPGTQ throughput was quadrupled! and PMUL now supported by 2 ports while scalar MUL only on one. You will fimnd many more examples, while only extremely expensive commands like divsion doesn't got the doubled throughput 3) when intel added avx/avx2 in SB/HW, it decreased throughput of some commands - f.e. Nehalem had double-issue both for bit shuffle and bit-combine commands, while SB/HW reduced their throughput to 1. So, if skylake are going to add avx-512 suppor, it may be expected that it will do the same (i.e. reduce thrpughput of rarely used commands), again to reduce power/transistor budget. But in practice, it doubled throughput of many commands while keeping single-issue throughput of shuffles. Idea that ports 0&1 will co-execute 512-bit commands, while port 5 will extend all its commands to 512 bits, excellently explains why it was made, while idea that everything will be just extended to 512 bits, fails miserable so, once i read Intel optimization manual, and thought a while, it became obvious. Moreover, i believe that skylake implemented all 4 ISA extensions that Intel was marketed (sgx/mpx/sha/avx3) but they were not enabled earlier due to marketing/market-slicing requirements. Intel just need a counter-weapon against Ryzen, so it doesn't show all Skylake sthrength in 2015 when its posiition was already strong --------- of course, m/a analysis can't say anything about commands absent in avx2 set, so my guess that predicate register manipulations will also go to port 5, just to make the m/a a bit less asymmetric also it's easy to predict that in the next generations the first "improvement" will be to add FMAD capability to port 5, further doubling the marketing perfromance figures |