Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Test results for Broadwell and Skylake
Author:  Date: 2016-03-09 20:58
More stuff. Have you measured total T-put of immediate data? AIDA64 readout is inconsistent and may be erroneous. Things to consider:
1) Legacy decoder should have different T-put than µop-cache; IDQ queue may or may not impose it's own restrictions.
2) As it is known for SB and IB (but may not be true for Haswell and newer CPUs; would be cool to test all of them), µop-cache slot has 4 bytes of data for both imm and ofs fields; so if (there is 8-byte const) or (total length of imm and ofs consts is >4 bytes) — 2 entries are allocated for that µop. Literal pool in scheduler may have it's own restrictions in port number (3…6) and width (4 or 8 bytes).
3) Instructions of interest:
—MOV r32/64,imm32/64 : 4/8 bytes of literals per instruction with 4 IPC of max. T-put (ideally should be 16/32 bytes/cl.);
—ADD r32,imm32 : 4 bytes of literals per instruction with 4 IPC of max. T-put;
—BLENDPS/PD xmm,[r+ofs32],imm8 : 5 bytes of total literals per instruction with 3 IPC of max. T-put, but only 2 L1D reads/cl.; may substitute 3-rd blend with MOVAPS [r+ofs32],xmm , having 5+5+4=14 bytes of literals for 3 IPC (but 5 µops).
 
thread Test results for Broadwell and Skylake new - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake new - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake new - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual new - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual new - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake new - Jess - 2016-02-11
last reply Description of discrepancy new - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake new - T - 2016-06-18
reply Instruction Throughput on Skylake new - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake new - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B new - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - - - 2017-06-19
replythread Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake new - - - 2017-07-05
last replythread Test results for Broadwell and Skylake new - - - 2017-07-12
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake new - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake new - Travis - 2017-07-13
last reply Official information about uOps and latency SNB+ new - SEt - 2017-07-17