Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

32B store-forwarding is slower than 16B
Author:  Date: 2017-05-11 10:37
Your microarch manual says that store-forwarding latency is 5c on Skylake for operand sizes other than 32/64b. I can confirm 5c for 128b vectors, but I've found that 256b store-forwarding is 6c on Skylake. I see your instruction tables already reflect this, so it's just a wording error in the microarch guide.

Also, in your instruction tables, you say that splitting up the store-forwarding latency between stores and loads is arbitrary. I disagree: It would be nice if loads listed the L1 load-use latency (from address being ready to data being ready). I don't think this is the case currently (e.g. you list Merom/Wolfdale/NHM/SnB's mov r,m as 2c latency, which is unreasonably low.)

If there are any CPUs where store-forwarding is faster than L1 load-use latency, that would mean negative latency for stores. But that's not the case on any x86 microarchitecture, I think.

----

While testing this on HSW and SKL, I found something weirder: an AVX128 load into an xmm register (zero-extending to 256) has an extra 1c of latency when read by a 256b instruction.


  • SKL: 12c for 3x dependent vmulps (xmm or ymm). HSW:15
  • 17c for 3x vmulps xmm and store/reload xmm. HSW:21. SF=5c/6c
  • 18c for 3x vmulps ymm and store/reload xmm. HSW:21 SF=6c/6c, or is it 5+1c?
  • 18c for 3x vmulps xmm and store/reload ymm. HSW:22 SF=6c/7c
  • 18c for 3x vmulps ymm and store/reload ymm. HSW:22 SF=6c/7c



vxorps xmm0,xmm0,xmm0
.loop:
vmulps ymm0, ymm0,ymm0
vmulps ymm0, ymm0,ymm0
vmulps ymm0, ymm0,ymm0
vmovaps [rdi], xmm0 ; This is the weird case for SKL: xmm store/reload with ymm FPU
vmovaps xmm0, [rdi]
dec ecx
jnz .loop

Also strange, with the mulps instructions commented out, I'm seeing SKL run the loop at only ~6.2c to 6.9c per iteration for *just* ymm store->reload with no ALU, rather than the expected 6.0c. So is there a limit to how often a 256b store-forward can happen? With xmm store/reload (and just a dec/jne But for xmm, the loop runs at one per 5.0c best case, sometimes as high as 5.02c per iter.

Same pattern for integer vectors: SKL doesn't benefit from narrowing the store/realod to xmm when the ALU loop is using ymm.

9c for 3x vpermd ymm SKL and HSW
15c for that + store/reload xmm (SKL and HSW). SF latency = 6c. (or 5+1c / 6c?)
15c for that + store/reload ymm SKL, 16c HSW. (movaps or movdqa). SF lat = 6c SKL, 7c HSW.

3c for 3x vpunpckldq ymm or xmm (SKL/HSW)
8.08 to 8.23c for vpunpck xmm + store/reload xmm. 9c HSW. SF=5.15c / 6c. (stabilizes to 5c / 6c with a longer ALU dependency chain between store/reload)
9c for vpunpck ymm + store/reload xmm (SKL). 9c HSW. SF=5+1c? / 6c
9c for vpunpck xmm + store/reload ymm. 10c HSW. SF=6c / 7c
9c for vpunpck ymm + store/reload ymm (SKL). 10c HSW. SF=6c / 7c

Using vmovaps vs. vmovdqa made no difference between either ivec or FPU instructions. rdi is pointing to a 64B-aligned buffer in the BSS.

So I'm seeing unstable results on SKL for doing a 128b store-forwarding with only 3c of ALU latency between the load and doing another store to the same address. Inserting more shuffles so fewer store-forwardings need to be kept in-flight stabilizes things so the store-forwarding latency is the expected 5.0c. HSW doesn't have that problem.

If the first shuffle is xmm and the others are ymm, then xmm store/reload only has 5c latency on SKL. So there's no extra latency for an ALU instruction to zero-extend, but there is for a load?

 
thread Test results for Broadwell and Skylake new - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake new - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake new - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual new - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual new - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake new - Jess - 2016-02-11
last reply Description of discrepancy new - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake new - T - 2016-06-18
reply Instruction Throughput on Skylake new - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake new - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B new - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - - - 2017-06-19
replythread Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake new - - - 2017-07-05
last replythread Test results for Broadwell and Skylake new - - - 2017-07-12
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake new - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake new - Travis - 2017-07-13
reply Official information about uOps and latency SNB+ new - SEt - 2017-07-17
last replythread Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07
last reply Test results for Broadwell and Skylake new - Agner - 2020-10-11