i think we will see that in a few weeks :) please keep that message, so we can compare it to the facts i have no insider info, just thorough knowledge of all these microarchitectures, from your and intel manuals. as you see, my analysis stands on strange sides of SKL microarchitecture - proposed implementation perfectly explains them all. SKL doubled and someyimes even quadrupled throughput of many commands in order to make ports 0&1 highly symmetric. and this doesn't make any sense, other than preparing these ports to perfrom 512-bit commands in tandem. SKL moved all but shuffle commands to ports 0&1 - and i think that is because only shuffle commands cannot be split into two 256-bit subcommands, so only these commands require port with a full 512-bit capability, and they dedicated port 5 to that task yes, my explanation is highly speculative, but i don't see other possible explanations of all these changes which made avx256 execution less efficient (because most of commands now are executed only by ports 0&1), nor explanations why many rare commands got higher throughput. if intel plan to just extend each 256-bit command to 512 bits, they will, on opposite, reduce throughputs of rarely-used commands (as it was done in SB/HW compared to Nehalem), and keep ports 0/1/5 equally-populated just one question - are you agree that SKL changes compared to HW are strange and either decrease performance (moving most commands to ports 0&1), or add more hardware for a tiny speedup (implementation of almost everything on BOTH ports 0&1)? btw, one hint is that Intel claims their 18-core cpu will outperform 1 TFLOPS. If skl-x will perfrom two 512-bit fma commands per cpu cycle, they may easily claim breaking 2 tflops barrier |