Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2015-12-27 18:59
Agner wrote:
You are always limited by cache ways, read/write buffers, faulty prefetching, suboptimal reordering, etc.
Yes, although in my example I'm considering the much simpler case where there are two reads but no writes, and all data is already in L1. So although problematic in the real world, these shouldn't be a factor here. In fact, I see the same maximum speed if I read the same 4 vectors over and over rather than striding over all the data. I've refined my example, though, and think I now understand what's happening. The problem isn't a bank conflict, rather it's a slowdown due to unaligned access. I don't think I've seen this discussed before.

Contrary to my previous understanding, alignment makes a big difference on the speed at which vectors are read from L1 to register. If your data is 16B aligned rather than 32B aligned, a sequential read from L1 is no faster with 256-bit YMM reads than it is with 128-bit XMM reads. VMOVAPS and VMOVUPS have the same speed, but you cannot achieve 2 32B loads per cycle if the underlying data is not 32B aligned. If the data is 32B aligned, you still can't quite sustain 64 B/cycle of load with either, but you can get to about 54 B/cycle with both.

I put up new test code here: https://gist.github.com/nkurz/439ca1044e11181c1089

Results at L1 sizes are essentially the same on Haswell and Skylake.

Loading 4096 floats with 64 byte raw alignment
Vector alignment 8:
load_xmm : 19.79 bytes/cycle
load_xmm_nonsequential : 23.41 bytes/cycle
load_ymm : 28.64 bytes/cycle
load_ymm_nonsequential : 36.57 bytes/cycle

Vector alignment 16:
load_xmm : 29.26 bytes/cycle
load_xmm_nonsequential : 29.05 bytes/cycle
load_ymm : 28.44 bytes/cycle
load_ymm_nonsequential : 36.90 bytes/cycle

Vector alignment 24:
load_xmm : 19.79 bytes/cycle
load_xmm_nonsequential : 23.54 bytes/cycle
load_ymm : 28.64 bytes/cycle
load_ymm_nonsequential : 36.57 bytes/cycle

Vector alignment 32:
load_xmm : 29.05 bytes/cycle
load_xmm_nonsequential : 28.85 bytes/cycle
load_ymm : 53.19 bytes/cycle
load_ymm_nonsequential : 52.51 bytes/cycle

What this says is that unless your loads are 32B aligned, regardless
of method you are limited to about 40B loaded per cycle. If you are
sequentially loading non-32B aligned data from L1, the speeds for 16B
loads and 32B loads are identical, and limited to less than 32B per
cycle. All alignments not shown were the same as 8B alignment.

Loading in a non-sequential order is about 20% faster for unaligned
XMM and unaligned YMM loads. It's possible there is a faster order
than I have found so far. Aligned loads are the same speed
regardless of order. Maximum speed for aligned XMM loads is about 30
B/cycle, and maximum speed for aligned YMM loads is about 54 B/cycle.

At L2 sizes, the effect still exists, but is less extreme. XMM loads
are limited to 13-15 B/cycle on both Haswell and Skylake. On Haswell,
YMM non-aligned loads are 18-20 B/cycle, and YMM aligned loads are
24-26 B/cycle. On Skylake, YMM aligned loads are slightly faster at
27 B/cycle. Interestingly, sequential unaligned L2 loads on Skylake
are almost the same as aligned loads (26 B/cycle), while non-sequential
loads are much slower (17 B/cycle).

At L3 sizes, alignment is barey a factor. On Haswell, all loads are
limited to 11-13 B/cycle. On Skylake, XMM loads are the same 11-13
B/cycle, while YMM loads are slightly faster at 14-17 B/cycle.

Coming from memory, XMM and YMM loads on Haswell are the same
regardless of alignment, at about 5 B/cycle. On Skylake, XMM loads
are about 6.25 B/cycle, and YMM loads are about 6.75 B/cycle, with
little dependence on alignment. It's possible that prefetch can
improve these speeds slightly.

The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7.
I don't recall if you mention it in your manuals, but I presume you are aware that Port 7 on Haswell and Skylake is only capable of "simple" address calculations? Thus sustaining 2 loads and a store is only possible if the store address is [const + base] form rather than [const + index*scale + base]. And as you point out, even if you do this, it can still be difficult to force the processor to use only Port 7 for the store address.
 
thread Test results for Broadwell and Skylake new - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake new - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake new - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual new - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual new - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake new - Jess - 2016-02-11
last reply Description of discrepancy new - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake new - T - 2016-06-18
reply Instruction Throughput on Skylake new - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake new - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B new - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - - - 2017-06-19
replythread Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake new - - - 2017-07-05
last replythread Test results for Broadwell and Skylake new - - - 2017-07-12
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake new - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake new - Travis - 2017-07-13
reply Official information about uOps and latency SNB+ new - SEt - 2017-07-17
last replythread Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07
last reply Test results for Broadwell and Skylake new - Agner - 2020-10-11