Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Sustained 64B loads per cycle on Haswell & Sky
Author: Agner Date: 2015-12-27 01:48
Nathan Kurz wrote:
reading the vectors in non-linear order improves the speed considerably. But the speed is still much less than .125 that we would see for the theoretical 2 loads per cycle.
It is possible to make two reads and one write in the same clock cycle, but it is not possible to obtain a continuous throughput at this theoretical maximum. You are always limited by cache ways, read/write buffers, faulty prefetching, suboptimal reordering, etc. The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7. It is quite likely that there are other effects that I am not aware of. The execution times that I have measured for 2 reads and 1 write are fluctuating a lot, and typically 40 - 60 % longer than the theoretical minimum.
 
thread Test results for Broadwell and Skylake - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky - Agner - 2015-12-27
last reply Sustained 64B loads per cycle on Haswell & Sky - Nathan Kurz - 2015-12-27
last reply Test results for Broadwell and Skylake - Peter Cordes - 2015-12-28