Nathan Kurz wrote:
reading the vectors in non-linear order improves the speed
considerably. But the speed is still much less than
.125 that we would see for the theoretical 2 loads per cycle.
It is possible to make two reads and one write in the same clock cycle, but it is not possible to obtain a continuous throughput at this theoretical maximum. You are always limited by cache ways, read/write buffers, faulty prefetching, suboptimal reordering, etc. The write operations may sometimes use port 2 or 3 for address calculation, where the maximum throughput requires that they use port 7. It is quite likely that there are other effects that I am not aware of. The execution times that I have measured for 2 reads and 1 write are fluctuating a lot, and typically 40 - 60 % longer than the theoretical minimum. |