Nathan Kurz wrote: ...
The write operations may sometimes
use port 2 or 3 for address calculation, where the
maximum throughput requires that they use port 7.
I don't recall if you mention it in your
manuals, but I presume you are aware that Port 7 on
Haswell and Skylake is only capable of
"simple" address calculations? Thus
sustaining 2 loads and a store is only possible if the
store address is [const + base] form rather than
[const + index*scale + base]. And as you point out,
even if you do this, it can still be difficult to
force the processor to use only Port 7 for the store
address.
I know I am really late in response to this but I think that Skylake can be "hinted" somewhat on the use of port 7 - at least for GPR based code. Consider the following (which is a core loop for a long addition routine) .Loop: mov Limb0, [Op1] ;1 1 p23 2 0.5
adc Limb0, [Op2] ;2 2 p06 p23 1
mov [Op3], Limb0 ;1 2 p237 p4 3 1
mov Limb1, [Op1+8] ;1 1 p23 2 0.5
adc Limb1, [Op2+8] ;2 2 p06 p23 1
mov [Op3+8], Limb1 ;1 2 p237 p4 3 1
mov Limb2, [Op1+16] ;1 1 p23 2 0.5
adc Limb2, [Op2+16] ;2 2 p06 p23 1
mov [Op3+16], Limb2 ;1 2 p237 p4 3 1
mov Limb3, [Op1+24] ;1 1 p23 2 0.5
adc Limb3, [Op2+24] ;2 2 p06 p23 1
mov [Op3+24], Limb3 ;1 2 p237 p4 3 1 mov Limb0, [Op1+32] ;1 1 p23 2 0.5
adc Limb0, [Op2+32] ;2 2 p06 p23 1
mov [Op3+32], Limb0 ;1 2 p237 p4 3 1
mov Limb1, [Op1+40] ;1 1 p23 2 0.5
adc Limb1, [Op2+40] ;2 2 p06 p23 1
mov [Op3+40], Limb1 ;1 2 p237 p4 3 1
mov Limb2, [Op1+48] ;1 1 p23 2 0.5
adc Limb2, [Op2+48] ;2 2 p06 p23 1
mov [Op3+48], Limb2 ;1 2 p237 p4 3 1
mov Limb3, [Op1+56] ;1 1 p23 2 0.5
adc Limb3, [Op2+56] ;2 2 p06 p23 1
mov [Op3+56], Limb3 ;1 2 p237 p4 3 1 lea Op1, [Op1+64] ;1 1 p15 1 0.5
lea Op2, [Op2+64] ;1 1 p15 1 0.5
lea Op3, [Op3+64] ;1 1 p15 1 0.5 .Check: dec Size1
jne .Loop On my Skylake system it executes in 817 cycles for Size1=683 (measured with RDTSCP). If I insert a "vpblend YMM0, YMM0, YMM0, 0" after "mov [Op3], Limb0" the execution time goes down to 698 cycles repeatedly! This seems to imply that port 7 is allways correctly used for the write. So far I haven't tried if a similar scheme - inserting a carefully choosen GPR opcode inside a AVX2 loop - yields similar results. |