Agner`s CPU blog

32B store-forwarding is slower than 16B

Author:

Date: 2017-05-11 10:37

Your microarch manual says that store-forwarding latency is 5c on Skylake for operand sizes other than 32/64b. I can confirm 5c for 128b vectors, but I've found that 256b store-forwarding is 6c on Skylake. I see your instruction tables already reflect this, so it's just a wording error in the microarch guide.

Also, in your instruction tables, you say that splitting up the store-forwarding latency between stores and loads is arbitrary. I disagree: It would be nice if loads listed the L1 load-use latency (from address being ready to data being ready). I don't think this is the case currently (e.g. you list Merom/Wolfdale/NHM/SnB's mov r,m as 2c latency, which is unreasonably low.)

If there are any CPUs where store-forwarding is faster than L1 load-use latency, that would mean negative latency for stores. But that's not the case on any x86 microarchitecture, I think.

----

While testing this on HSW and SKL, I found something weirder: an AVX128 load into an xmm register (zero-extending to 256) has an extra 1c of latency when read by a 256b instruction.

SKL: 12c for 3x dependent vmulps (xmm or ymm). HSW:15
17c for 3x vmulps xmm and store/reload xmm. HSW:21. SF=5c/6c
18c for 3x vmulps ymm and store/reload xmm. HSW:21 SF=6c/6c, or is it 5+1c?
18c for 3x vmulps xmm and store/reload ymm. HSW:22 SF=6c/7c
18c for 3x vmulps ymm and store/reload ymm. HSW:22 SF=6c/7c



    vxorps    xmm0,xmm0,xmm0

.loop:

     vmulps   ymm0, ymm0,ymm0

     vmulps   ymm0, ymm0,ymm0

     vmulps   ymm0, ymm0,ymm0

     vmovaps  [rdi], xmm0           ; This is the weird case for SKL: xmm store/reload with ymm FPU

     vmovaps  xmm0, [rdi]

     dec     ecx

     jnz .loop

Also strange, with the mulps instructions commented out, I'm seeing SKL run the loop at only ~6.2c to 6.9c per iteration for *just* ymm store->reload with no ALU, rather than the expected 6.0c. So is there a limit to how often a 256b store-forward can happen? With xmm store/reload (and just a dec/jne But for xmm, the loop runs at one per 5.0c best case, sometimes as high as 5.02c per iter.

Same pattern for integer vectors: SKL doesn't benefit from narrowing the store/realod to xmm when the ALU loop is using ymm.

9c for 3x vpermd ymm SKL and HSW
15c for that + store/reload xmm (SKL and HSW). SF latency = 6c. (or 5+1c / 6c?)
15c for that + store/reload ymm SKL, 16c HSW. (movaps or movdqa). SF lat = 6c SKL, 7c HSW.

3c for 3x vpunpckldq ymm or xmm (SKL/HSW)
8.08 to 8.23c for vpunpck xmm + store/reload xmm. 9c HSW. SF=5.15c / 6c. (stabilizes to 5c / 6c with a longer ALU dependency chain between store/reload)
9c for vpunpck ymm + store/reload xmm (SKL). 9c HSW. SF=5+1c? / 6c
9c for vpunpck xmm + store/reload ymm. 10c HSW. SF=6c / 7c
9c for vpunpck ymm + store/reload ymm (SKL). 10c HSW. SF=6c / 7c

Using vmovaps vs. vmovdqa made no difference between either ivec or FPU instructions. rdi is pointing to a 64B-aligned buffer in the BSS.

So I'm seeing unstable results on SKL for doing a 128b store-forwarding with only 3c of ALU latency between the load and doing another store to the same address. Inserting more shuffles so fewer store-forwardings need to be kept in-flight stabilizes things so the store-forwarding latency is the expected 5.0c. HSW doesn't have that problem.

If the first shuffle is xmm and the others are ymm, then xmm store/reload only has 5c latency on SKL. So there's no extra latency for an ALU instruction to zero-extend, but there is for a load?

Reply To This Message

Previous Message

Next Message