Agner`s CPU blog

Sustained 64B loads per cycle on Haswell & Sky

Author:

Date: 2015-12-26 18:03

Hi Agner --

Great to see the updates for Skylake! Thanks for putting all the effort into making these. Your guides are tremendous resources.

You mention in your guides that bank conflicts should no longer be a problem for Haswell or Skylake, and that "There are two identical memory read ports (port 2 and 3) and one write port (port 4). These ports all have the full 256 bits width. This makes it possible to make two memory reads and one memory write per clock cycle, with any register size up to 256 bits.". You also say that cache bank conflicts are not a problem, and that "It is always possible to do two cache reads in the same clock cycle without causing a cache bank conflict."

Do you have code that demonstrates this? Even without writes, I'm currently unable to create code that can sustain 2 256-bit loads per cycle from L1D. I started with code that used a fused-multiply-add, but then realized that I was being slowed down by the loads rather than the math. I'm also seeing timing effects that make me suspect that some sort of bank conflict much be occurring, since some orderings of loads from L1 are consistently faster than others. I've put my current test code up here: https://gist.github.com/nkurz/9a0ed5a9a6e591019b8e

When compiled with "gcc -fno-inline -std=gnu99 -Wall -O3 -g -march=native l1d.c -o l1d", results look like this on Haswell:
Testing with SIZE=4096...
calc_simple(array1, array2, size): 5.01 cycles/input
calc_fma(array1, array2, size): 0.22 cycles/input
calc_fma_reordered(array1, array2, size): 0.20 cycles/input
calc_load_only(array1, array2, size): 0.21 cycles/input [ERROR]
calc_load_only_reordered(array1, array2, size): 0.18 cycles/input [ERROR]

And like this on Skylake:
Testing with SIZE=4096...
calc_simple(array1, array2, size): 4.02 cycles/input
calc_fma(array1, array2, size): 0.20 cycles/input
calc_fma_reordered(array1, array2, size): 0.17 cycles/input
calc_load_only(array1, array2, size): 0.20 cycles/input [ERROR]
calc_load_only_reordered(array1, array2, size): 0.17 cycles/input [ERROR]

calc_simple() shows that the latency of an FMA on Haswell is 5 cycles, while it's only 4 cycles on Skylake. It's a simple approach in that there is no unrolling, so we are latency limited. So far, so good.

calc_fma() shows a straightforward approach of loading 4 YMM vectors of floats, and then multiplying them by another 4 YMM vectors of floats, using 4 separate accumulators. Results are slightly slower on Haswell than on Skylake, presumably because 4-way unrolling is not enough to hide the 5 cycle latency of the FMA on Haswell.

calc_fma_reordered() is the first surprise. This is the same as calc_fma(), but loads the vectors in a different order: +96, +32, +64, +0 instead of the in-order byte offsets of +0, +32, +64, +96. I haven't seen any theory that would explain why there would be a difference in speed for these two orders.

calc_load_only() is the next surprise. I dropped the FMA altogether, and just did the loads. We get a slight speed up on Haswell (agreeing with the FMA latency), but no speed up on Skylake. Since there is nothing in the loop but the loads, if we can execute 2 32B loads per cycle, I would have expected to see .125 cycles per input. The [ERROR] on the line is expected, and is because we are not actually calculating the sum.

calc_load_only_reordered() continues the surprise. Once again, reading the vectors in non-linear order improves the speed considerably. But the speed is still much less than .125 that we would see for the theoretical 2 loads per cycle. Again, [ERROR] is expected because their is no math being done.

Do you have any idea what's happening here? Why would the ordering of the loads matter if all the results are in L1D? Why can't I get to .125 cycles per float? I've inspected the results with 'perf record -F 10000 ./l1d' / 'perf report' on both machines, and the assembly looks like I'd expect. I can make the loop logic slightly better, but this doesn't seem to be the limiting factor. What do I need to do differently to achieve sustained load speeds of 64B per cycle on Haswell and Skylake?

Reply To This Message

Previous Message

Next Message