Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

List Messageboards

32B store-forwarding is slower than 16B

Author:

Date: 2017-06-28 18:33

I believe this is, ultimately, an artifact of how 256-bit vectors are implemented internally. Namely, I believe that the lower and upper 128-bit data paths are 1 cycle offset from each other (the upper datapath issues its half of the instruction one cycle later than the lower datapath does). [This helps switching between true 256-bit execution and 256bit-cracked-into-two-128bit-ops execution, because the load operand etc. timings are the same in both cases; this should simplify the load path and the bypass network.]

This is also my leading guess for the explanation of the 3-cycle latency of operations that cross the 128-bit halves: the lower and upper 128 bits are not only skewed in time, they are also separate bypass domains. So potentially cross-128b operations like vextracti128 have 1 extra cycle of latency purely from upper half of the input being available 1 cycle later than the lower half, and another extra cycle cross-domain bypass delay to shuttle the result from the upper bypass to the lower datapath.

Anyway, all of this is speculation, but if correct, then while 256-bit stores have full throughput (when running in 256-bit mode anyway), the second half of their data arrives in the designated store buffer slot one cycle later, and the store buffer is only marked as "data available" (with the values available for forwarding) once both halves have arrived. Thus the extra 1-cycle forwading delay.

Reply To This Message

Previous Message

Test results for Broadwell and Skylake new - Agner - 2015-12-26

Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26

Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27

Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27

Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04

Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18

Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12

Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28

Test results for Broadwell and Skylake new - Agner - 2015-12-29

Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04

Test results for Broadwell and Skylake new - Agner - 2016-01-05

Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09

Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05

Minor bug in the microarchitecture manual new - SHK - 2016-01-10

Minor bug in the microarchitecture manual new - Agner - 2016-01-16

Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12

Test results for Broadwell and Skylake new - Jess - 2016-02-11

Description of discrepancy new - Nathan Kurz - 2016-03-13

Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22

Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23

Instruction Throughput on Skylake new - Agner - 2016-04-24

Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26

Instruction Throughput on Skylake new - Agner - 2016-04-27

Instruction Throughput on Skylake new - T - 2016-06-18

Instruction Throughput on Skylake new - Agner - 2016-06-19

Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08

Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11

Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17

Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11

Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11

Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12

Instruction Throughput on Skylake new - T - 2016-08-08

Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09

32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11

32B store-forwarding is slower than 16B - Fabian Giesen - 2017-06-28

32B store-forwarding is slower than 16B new - Agner - 2017-06-28

SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30

Test results for Broadwell and Skylake new - Agner - 2017-05-30

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30

Test results for Broadwell and Skylake new - - - 2017-06-19

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26

Test results for Broadwell and Skylake new - - - 2017-07-05

Test results for Broadwell and Skylake new - - - 2017-07-12

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19

Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28

Test results for Broadwell and Skylake new - Travis - 2017-06-29

Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30

Test results for Broadwell and Skylake new - Travis - 2017-07-13

Official information about uOps and latency SNB+ new - SEt - 2017-07-17

Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07

Test results for Broadwell and Skylake new - Agner - 2020-10-11

List Messageboards