Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

List Messageboards

Sustained 64B loads per cycle on Haswell & Sky

Author:

Date: 2016-01-04 07:21

Thanks to Nathan Kurz for the interesting test code.

I was able to reproduce the results on a Xeon E5-2660 v3 system once I pinned the core frequency to match the nominal frequency (2.5 GHz on that system).

It looks like the results are actually a bit better than reported because the tests are short enough that the timer overhead is not negligible. I modified the code to print out the "cycle_diff" variable in each case and see that the fastest tests are only about 312 cycles. RDTSCP overhead on this system is 32 cycles (for my very similar inline assembly), which suggests that the loop is only taking about 280 cycles. This raises the estimate of the throughput from 52.5 Bytes/cycle to 52.5*312/280 = 58.5 Bytes/cycle. This is 91.4% of peak, which is almost as fast as the best results I have been able to obtain with a DDOT kernel.

For my DDOT measurements, I ran a variety of problem sizes and did a least-squares fit to estimate the slope and intercept of the cycle count as a function of problem size. This gave estimated slopes corresponding to up to ~95% of 64 Bytes/cycle. (I used this approach because I was reading not only the TSC, but up to 8 PMCs as well, and the total overhead became quite large -- well over 200 cycles.)

In my experience, it is exceedingly difficult to understand performance limiters once you have reached this level of performance -- even if you are on the hardware engineering team! As a rule of thumb, anything exceeding 8/9 (88.9%) of the simple theoretical peak is pretty close to asymptotic, and exceeding 16/17 (94.1%) of peak is extremely uncommon.

Reply To This Message

Previous Message

Test results for Broadwell and Skylake new - Agner - 2015-12-26

Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26

Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27

Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27

Sustained 64B loads per cycle on Haswell & Sky - John D. McCalpin - 2016-01-04

Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18

Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12

Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28

Test results for Broadwell and Skylake new - Agner - 2015-12-29

Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04

Test results for Broadwell and Skylake new - Agner - 2016-01-05

Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09

Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05

Minor bug in the microarchitecture manual new - SHK - 2016-01-10

Minor bug in the microarchitecture manual new - Agner - 2016-01-16

Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12

Test results for Broadwell and Skylake new - Jess - 2016-02-11

Description of discrepancy new - Nathan Kurz - 2016-03-13

Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22

Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23

Instruction Throughput on Skylake new - Agner - 2016-04-24

Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26

Instruction Throughput on Skylake new - Agner - 2016-04-27

Instruction Throughput on Skylake new - T - 2016-06-18

Instruction Throughput on Skylake new - Agner - 2016-06-19

Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08

Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11

Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17

Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11

Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11

Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12

Instruction Throughput on Skylake new - T - 2016-08-08

Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09

32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11

32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28

32B store-forwarding is slower than 16B new - Agner - 2017-06-28

SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30

Test results for Broadwell and Skylake new - Agner - 2017-05-30

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30

Test results for Broadwell and Skylake new - - - 2017-06-19

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26

Test results for Broadwell and Skylake new - - - 2017-07-05

Test results for Broadwell and Skylake new - - - 2017-07-12

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19

Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28

Test results for Broadwell and Skylake new - Travis - 2017-06-29

Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30

Test results for Broadwell and Skylake new - Travis - 2017-07-13

Official information about uOps and latency SNB+ new - SEt - 2017-07-17

Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07

Test results for Broadwell and Skylake new - Agner - 2020-10-11

List Messageboards