Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Sustained 64B loads per cycle on Haswell & Sky
Author:  Date: 2015-12-26 18:03
Hi Agner --

Great to see the updates for Skylake! Thanks for putting all the effort into making these. Your guides are tremendous resources.

You mention in your guides that bank conflicts should no longer be a problem for Haswell or Skylake, and that "There are two identical memory read ports (port 2 and 3) and one write port (port 4). These ports all have the full 256 bits width. This makes it possible to make two memory reads and one memory write per clock cycle, with any register size up to 256 bits.". You also say that cache bank conflicts are not a problem, and that "It is always possible to do two cache reads in the same clock cycle without causing a cache bank conflict."

Do you have code that demonstrates this? Even without writes, I'm currently unable to create code that can sustain 2 256-bit loads per cycle from L1D. I started with code that used a fused-multiply-add, but then realized that I was being slowed down by the loads rather than the math. I'm also seeing timing effects that make me suspect that some sort of bank conflict much be occurring, since some orderings of loads from L1 are consistently faster than others. I've put my current test code up here: https://gist.github.com/nkurz/9a0ed5a9a6e591019b8e

When compiled with "gcc -fno-inline -std=gnu99 -Wall -O3 -g -march=native l1d.c -o l1d", results look like this on Haswell:
Testing with SIZE=4096...
calc_simple(array1, array2, size): 5.01 cycles/input
calc_fma(array1, array2, size): 0.22 cycles/input
calc_fma_reordered(array1, array2, size): 0.20 cycles/input
calc_load_only(array1, array2, size): 0.21 cycles/input [ERROR]
calc_load_only_reordered(array1, array2, size): 0.18 cycles/input [ERROR]

And like this on Skylake:
Testing with SIZE=4096...
calc_simple(array1, array2, size): 4.02 cycles/input
calc_fma(array1, array2, size): 0.20 cycles/input
calc_fma_reordered(array1, array2, size): 0.17 cycles/input
calc_load_only(array1, array2, size): 0.20 cycles/input [ERROR]
calc_load_only_reordered(array1, array2, size): 0.17 cycles/input [ERROR]


calc_simple() shows that the latency of an FMA on Haswell is 5 cycles, while it's only 4 cycles on Skylake. It's a simple approach in that there is no unrolling, so we are latency limited. So far, so good.

calc_fma() shows a straightforward approach of loading 4 YMM vectors of floats, and then multiplying them by another 4 YMM vectors of floats, using 4 separate accumulators. Results are slightly slower on Haswell than on Skylake, presumably because 4-way unrolling is not enough to hide the 5 cycle latency of the FMA on Haswell.

calc_fma_reordered() is the first surprise. This is the same as calc_fma(), but loads the vectors in a different order: +96, +32, +64, +0 instead of the in-order byte offsets of +0, +32, +64, +96. I haven't seen any theory that would explain why there would be a difference in speed for these two orders.

calc_load_only() is the next surprise. I dropped the FMA altogether, and just did the loads. We get a slight speed up on Haswell (agreeing with the FMA latency), but no speed up on Skylake. Since there is nothing in the loop but the loads, if we can execute 2 32B loads per cycle, I would have expected to see .125 cycles per input. The [ERROR] on the line is expected, and is because we are not actually calculating the sum.

calc_load_only_reordered() continues the surprise. Once again, reading the vectors in non-linear order improves the speed considerably. But the speed is still much less than .125 that we would see for the theoretical 2 loads per cycle. Again, [ERROR] is expected because their is no math being done.

Do you have any idea what's happening here? Why would the ordering of the loads matter if all the results are in L1D? Why can't I get to .125 cycles per float? I've inspected the results with 'perf record -F 10000 ./l1d' / 'perf report' on both machines, and the assembly looks like I'd expect. I can make the loop logic slightly better, but this doesn't seem to be the limiting factor. What do I need to do differently to achieve sustained load speeds of 64B per cycle on Haswell and Skylake?

 
thread Test results for Broadwell and Skylake new - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake new - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake new - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual new - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual new - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake new - Jess - 2016-02-11
last reply Description of discrepancy new - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake new - T - 2016-06-18
reply Instruction Throughput on Skylake new - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake new - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B new - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - - - 2017-06-19
replythread Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake new - - - 2017-07-05
last replythread Test results for Broadwell and Skylake new - - - 2017-07-12
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake new - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake new - Travis - 2017-07-13
reply Official information about uOps and latency SNB+ new - SEt - 2017-07-17
last replythread Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07
last reply Test results for Broadwell and Skylake new - Agner - 2020-10-11