Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

List Messageboards

Sustained 64B loads per cycle on Haswell & Sky

Author: Agner

Date: 2015-12-27 01:48

Nathan Kurz wrote:

reading the vectors in non-linear order improves the speed considerably. But the speed is still much less than .125 that we would see for the theoretical 2 loads per cycle.

It is possible to make two reads and one write in the same clock cycle, but it is not possible to obtain a continuous throughput at this theoretical maximum. You are always limited by cache ways, read/write buffers, faulty prefetching, suboptimal reordering, etc. The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7. It is quite likely that there are other effects that I am not aware of. The execution times that I have measured for 2 reads and 1 write are fluctuating a lot, and typically 40 - 60 % longer than the theoretical minimum.

Reply To This Message

Previous Message

Test results for Broadwell and Skylake new - Agner - 2015-12-26

Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26

Sustained 64B loads per cycle on Haswell & Sky - Agner - 2015-12-27

Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27

Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04

Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18

Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12

Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28

Test results for Broadwell and Skylake new - Agner - 2015-12-29

Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04

Test results for Broadwell and Skylake new - Agner - 2016-01-05

Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09

Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05

Minor bug in the microarchitecture manual new - SHK - 2016-01-10

Minor bug in the microarchitecture manual new - Agner - 2016-01-16

Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12

Test results for Broadwell and Skylake new - Jess - 2016-02-11

Description of discrepancy new - Nathan Kurz - 2016-03-13

Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22

Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23

Instruction Throughput on Skylake new - Agner - 2016-04-24

Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26

Instruction Throughput on Skylake new - Agner - 2016-04-27

Instruction Throughput on Skylake new - T - 2016-06-18

Instruction Throughput on Skylake new - Agner - 2016-06-19

Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08

Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11

Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17

Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11

Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11

Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12

Instruction Throughput on Skylake new - T - 2016-08-08

Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09

32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11

32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28

32B store-forwarding is slower than 16B new - Agner - 2017-06-28

SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30

Test results for Broadwell and Skylake new - Agner - 2017-05-30

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30

Test results for Broadwell and Skylake new - - - 2017-06-19

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26

Test results for Broadwell and Skylake new - - - 2017-07-05

Test results for Broadwell and Skylake new - - - 2017-07-12

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19

Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28

Test results for Broadwell and Skylake new - Travis - 2017-06-29

Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30

Test results for Broadwell and Skylake new - Travis - 2017-07-13

Official information about uOps and latency SNB+ new - SEt - 2017-07-17

Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07

Test results for Broadwell and Skylake new - Agner - 2020-10-11

List Messageboards