Agner wrote:
You are always limited by cache ways,
read/write buffers, faulty prefetching, suboptimal
reordering, etc.
Yes, although in my example I'm considering the much simpler case where there are two reads but no writes, and all data is already in L1. So although problematic in the real world, these shouldn't be a factor here. In fact, I see the same maximum speed if I read the same 4 vectors over and over rather than striding over all the data. I've refined my example, though, and think I now understand what's happening. The problem isn't a bank conflict, rather it's a slowdown due to unaligned access. I don't think I've seen this discussed before. Contrary to my previous understanding, alignment makes a big difference on the speed at which vectors are read from L1 to register. If your data is 16B aligned rather than 32B aligned, a sequential read from L1 is no faster with 256-bit YMM reads than it is with 128-bit XMM reads. VMOVAPS and VMOVUPS have the same speed, but you cannot achieve 2 32B loads per cycle if the underlying data is not 32B aligned. If the data is 32B aligned, you still can't quite sustain 64 B/cycle of load with either, but you can get to about 54 B/cycle with both. I put up new test code here: https://gist.github.com/nkurz/439ca1044e11181c1089 Results at L1 sizes are essentially the same on Haswell and Skylake. Loading 4096 floats with 64 byte raw alignment
Vector alignment 8:
load_xmm : 19.79 bytes/cycle
load_xmm_nonsequential : 23.41 bytes/cycle
load_ymm : 28.64 bytes/cycle
load_ymm_nonsequential : 36.57 bytes/cycle Vector alignment 16:
load_xmm : 29.26 bytes/cycle
load_xmm_nonsequential : 29.05 bytes/cycle
load_ymm : 28.44 bytes/cycle
load_ymm_nonsequential : 36.90 bytes/cycle Vector alignment 24:
load_xmm : 19.79 bytes/cycle
load_xmm_nonsequential : 23.54 bytes/cycle
load_ymm : 28.64 bytes/cycle
load_ymm_nonsequential : 36.57 bytes/cycle Vector alignment 32:
load_xmm : 29.05 bytes/cycle
load_xmm_nonsequential : 28.85 bytes/cycle
load_ymm : 53.19 bytes/cycle
load_ymm_nonsequential : 52.51 bytes/cycle What this says is that unless your loads are 32B aligned, regardless
of method you are limited to about 40B loaded per cycle. If you are
sequentially loading non-32B aligned data from L1, the speeds for 16B
loads and 32B loads are identical, and limited to less than 32B per
cycle. All alignments not shown were the same as 8B alignment. Loading in a non-sequential order is about 20% faster for unaligned
XMM and unaligned YMM loads. It's possible there is a faster order
than I have found so far. Aligned loads are the same speed
regardless of order. Maximum speed for aligned XMM loads is about 30
B/cycle, and maximum speed for aligned YMM loads is about 54 B/cycle. At L2 sizes, the effect still exists, but is less extreme. XMM loads
are limited to 13-15 B/cycle on both Haswell and Skylake. On Haswell,
YMM non-aligned loads are 18-20 B/cycle, and YMM aligned loads are
24-26 B/cycle. On Skylake, YMM aligned loads are slightly faster at
27 B/cycle. Interestingly, sequential unaligned L2 loads on Skylake
are almost the same as aligned loads (26 B/cycle), while non-sequential
loads are much slower (17 B/cycle). At L3 sizes, alignment is barey a factor. On Haswell, all loads are
limited to 11-13 B/cycle. On Skylake, XMM loads are the same 11-13
B/cycle, while YMM loads are slightly faster at 14-17 B/cycle. Coming from memory, XMM and YMM loads on Haswell are the same
regardless of alignment, at about 5 B/cycle. On Skylake, XMM loads
are about 6.25 B/cycle, and YMM loads are about 6.75 B/cycle, with
little dependence on alignment. It's possible that prefetch can
improve these speeds slightly.
The write operations may sometimes
use port 2 or 3 for address calculation, where the
maximum throughput requires that they use port 7.
I don't recall if you mention it in your manuals, but I presume you are aware that Port 7 on Haswell and Skylake is only capable of "simple" address calculations? Thus sustaining 2 loads and a store is only possible if the store address is [const + base] form rather than [const + index*scale + base]. And as you point out, even if you do this, it can still be difficult to force the processor to use only Port 7 for the store address. |