Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2016-06-18 20:32
The aligned vs unaligned results make intuitive sense. In recent processors, the penalty for unaligned access has been made faster: the penalty went to zero on Sandy Bridge (and perhaps earlier), at least for loads that didn't cross a 64B cache-line boundary. In Haswell, even the 64B latency penalty disappeared - although only for loads, not stores. You can see this all graphically here:

blog.stuffedcow.net/2014/01/x86-memory-disambiguation/

The 2D charts are trying to get at the penalty of store-to-load forwarding, but the cells off of the main diagonal do a great job of showing the unaligned load/store penalties as well.

So you are finding that unaligned loads *still* have a penalty, even on Skylake - right? The key is loads that cross a 64B boundary. Fundamentally that requires bringing in two different lines from the L1, and merging the results so you get a word composed of some one line and some of another. The improvements culminating in Haswell reduced the latency of this operation to the point where it fits inside the standard 4 cycle latency for ideal L1 access, but it can't avoid the double bandwidth usage of the unaligned loads. In many algorithms, the maximum bandwidth of the L1 isn't approached (i.e,. the loads-per-cycle are 1 or less), so unaligned access ends up the same as aligned. In your loop, however, you do saturate the load bandwidth, so loads that cross a 64B boundary will cut your throughput in half, or worse.

It doesn't explain the results you got by inverting the load order, but perhaps some of that can be explained by how the loads "pair up". That is, two aligned loads can pair up in the same cycle since each only needs 1 of the 2 "load paths" from L1. An unaligned load needs both, however. So if you have a load pattern like AAUUAAUU (where A is an aligned load and U is unaligned) you get:

cycle loads
0 AA
1 U
2 U
3 AA
4 U
5 U
...

So you get 4 loads every 3 cycles, because the aligned loads are always able to pair.

On the other hand, if you have a load pattern like AUAUAUAUA, you get the following:

cycle loads
0 A
1 U
2 A
3 U
....

I.e., only 3 loads every 3 cycles, or a 25% penalty to throughput, because the aligned loads end up being singletons as well. You might ask why OoO wouldn't solve this - well OoO is based on the scheduler which understands instruction dependencies, and has a few other special-case tricks to re-order things (e.g,. to avoid port retirement conflicts), but otherwise still does stuff in-order. So likely can't understand that it should try to reorder the loads to pair aligned loads. Furthermore the memory model imposes restrictions on reodering loads (but I don't fully grok how this actually falls out in practice when you consider load buffers and the coherency protocol and so on).

All that to say that reordering the loads might easily swap the behavior from an AAUU behavior to an AUAU one.

 
thread Test results for Broadwell and Skylake new - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake new - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake new - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual new - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual new - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake new - Jess - 2016-02-11
last reply Description of discrepancy new - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake new - T - 2016-06-18
reply Instruction Throughput on Skylake new - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake new - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B new - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - - - 2017-06-19
replythread Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake new - - - 2017-07-05
last replythread Test results for Broadwell and Skylake new - - - 2017-07-12
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake new - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake new - Travis - 2017-07-13
reply Official information about uOps and latency SNB+ new - SEt - 2017-07-17
last replythread Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07
last reply Test results for Broadwell and Skylake new - Agner - 2020-10-11