Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Test results for Broadwell and Skylake
Author:  Date: 2017-06-21 12:49
russian review https://3dnews.ru/954174 , as usual, has more thorough low-level benchmarks than anand. In particular, important test: https://3dnews.ru/assets/external/illustrations/2017/06/19/954174/avx-512.png

As we can see here, FP computations got almost 2x speedup, while INT got only 20-40% improvements

I think, the last result perfectly lines with my prediction - port5 was extended to 512 bits, so bit shuffling becomes 2x faster, and PADD group got 33% boost. I expected 10-20% overall speedup, but probably new AVX512 features (new instructions, built-in masking) further improved the performance

My last prediction was: "also it's easy to predict that in the next generations the first "improvement" will be to add FMAD capability to port 5, further doubling the marketing performance figures"

I don't expected it in Skylake generation due to excessive TDP increase (as we know, even using AVX2 on previous generations increased TDP by 40%, so two full-featured AVX512 ports should *further* increase TDP by 80%!). Nevertheless, they have done exactly that, and got very expected TDP problems.

Note that from 3dnews test, we can draw conclusion that port5 added only FMA engine, but no other AVX512 commands (except for mere extension of AVX2 commands already populated on this port)

So, i can say that my speculation turned to be 200% right :)


But refreshing all that we know, it seems that from technical VP, skylake is a total mess! The SKL architecture i predicted was compromise - it added as little as possible hardware unused in AVX256 mode, but still had AVX512 support. It was a great step toward future processors - add 512-bit support for forward compatibility, but don't invest heavily in AVX512-only hardware until more 512-bit programs will arrive. To reach this goal, they made some changes that were bad for AVX2 programs (see my second post)

But when they added the second FMA512 engine, this became meaningless. Now we have design that both limits AVX2 performance and has a lot of hardware unused in AVX2 mode! By simple extending Haswell engines 2x we can got a bit higher transistor count and much better AVX512 performance

I think this is result of marketing games - SKL-S already had AVX512 support (without second FMA engine, though), but they decided to disable it on all SKUs. Newer SKL-X added the second engine, but enabled it only on selected SKUs, so i7 provides exactly the architecture i predicted (and probably it was their Plan B - use SKL-S cores with a single FMA engine for HEDT/Xeon products)

Now we can also see why SKL-S reduced L2$ associativity to 4. It was preparation to increasing cache size - SKL-S cache is just a quarter of SKL-X cache with the same organization, and reduced associativity allowed to reduce transistor budget of massive 1MB cache. This is a sign that SKL-X is much smaller modification of SKL-S core than we can think at the first sight

 
thread Test results for Broadwell and Skylake new - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake new - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake new - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual new - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual new - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake new - Jess - 2016-02-11
last reply Description of discrepancy new - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake new - T - 2016-06-18
reply Instruction Throughput on Skylake new - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake new - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B new - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - - - 2017-06-19
replythread Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake new - - - 2017-07-05
last replythread Test results for Broadwell and Skylake new - - - 2017-07-12
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake new - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake new - Travis - 2017-07-13
reply Official information about uOps and latency SNB+ new - SEt - 2017-07-17
last replythread Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07
last reply Test results for Broadwell and Skylake new - Agner - 2020-10-11