Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2017-01-12 02:40
Nathan Kurz wrote:

...

The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7.
I don't recall if you mention it in your manuals, but I presume you are aware that Port 7 on Haswell and Skylake is only capable of "simple" address calculations? Thus sustaining 2 loads and a store is only possible if the store address is [const + base] form rather than [const + index*scale + base]. And as you point out, even if you do this, it can still be difficult to force the processor to use only Port 7 for the store address.
I know I am really late in response to this but I think that Skylake can be "hinted" somewhat on the use of port 7 - at least for GPR based code. Consider the following (which is a core loop for a long addition routine)

.Loop:

mov Limb0, [Op1] ;1 1 p23 2 0.5
adc Limb0, [Op2] ;2 2 p06 p23 1
mov [Op3], Limb0 ;1 2 p237 p4 3 1
mov Limb1, [Op1+8] ;1 1 p23 2 0.5
adc Limb1, [Op2+8] ;2 2 p06 p23 1
mov [Op3+8], Limb1 ;1 2 p237 p4 3 1
mov Limb2, [Op1+16] ;1 1 p23 2 0.5
adc Limb2, [Op2+16] ;2 2 p06 p23 1
mov [Op3+16], Limb2 ;1 2 p237 p4 3 1
mov Limb3, [Op1+24] ;1 1 p23 2 0.5
adc Limb3, [Op2+24] ;2 2 p06 p23 1
mov [Op3+24], Limb3 ;1 2 p237 p4 3 1

mov Limb0, [Op1+32] ;1 1 p23 2 0.5
adc Limb0, [Op2+32] ;2 2 p06 p23 1
mov [Op3+32], Limb0 ;1 2 p237 p4 3 1
mov Limb1, [Op1+40] ;1 1 p23 2 0.5
adc Limb1, [Op2+40] ;2 2 p06 p23 1
mov [Op3+40], Limb1 ;1 2 p237 p4 3 1
mov Limb2, [Op1+48] ;1 1 p23 2 0.5
adc Limb2, [Op2+48] ;2 2 p06 p23 1
mov [Op3+48], Limb2 ;1 2 p237 p4 3 1
mov Limb3, [Op1+56] ;1 1 p23 2 0.5
adc Limb3, [Op2+56] ;2 2 p06 p23 1
mov [Op3+56], Limb3 ;1 2 p237 p4 3 1

lea Op1, [Op1+64] ;1 1 p15 1 0.5
lea Op2, [Op2+64] ;1 1 p15 1 0.5
lea Op3, [Op3+64] ;1 1 p15 1 0.5

.Check:

dec Size1
jne .Loop

On my Skylake system it executes in 817 cycles for Size1=683 (measured with RDTSCP). If I insert a "vpblend YMM0, YMM0, YMM0, 0" after "mov [Op3], Limb0" the execution time goes down to 698 cycles repeatedly! This seems to imply that port 7 is allways correctly used for the write. So far I haven't tried if a similar scheme - inserting a carefully choosen GPR opcode inside a AVX2 loop - yields similar results.

 
thread Test results for Broadwell and Skylake new - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake new - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake new - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual new - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual new - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake new - Jess - 2016-02-11
last reply Description of discrepancy new - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake new - T - 2016-06-18
reply Instruction Throughput on Skylake new - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake new - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B new - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - - - 2017-06-19
replythread Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake new - - - 2017-07-05
last replythread Test results for Broadwell and Skylake new - - - 2017-07-12
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake new - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake new - Travis - 2017-07-13
reply Official information about uOps and latency SNB+ new - SEt - 2017-07-17
last replythread Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07
last reply Test results for Broadwell and Skylake new - Agner - 2020-10-11