There is an interesting effect which changed in Skylake (or at least some architecture after Sandy Bridge, up to and including Skylake), but isn't covered in your manual. It concerns the behavior of micro-fused instructions with *complex* memory source or destination operands. Here complex means with base and index registers, so something like add rax, [rbx + rcx] In Sandy Bridge, this doesn't seem to micro-fuse in the same way as simpler addressing modes such as: add rax, [rbx + 16] In particular, while it seems that the complex address modes fuse in the uop cache, the constituent ops are later "unlaminated" and consume rename and retirement resources. In particular, this means that you cannot achieve 4 micro-fused uops/cycle throughput with these addressing modes. The Intel optimization doc does touch on it briefly in 2.3.2.4 Micro-op Queue and the Loop Stream Detector (LSD): In particular, loads combined with computational operations and all stores, when used
with indexed addressing, are represented as a single micro-op in the decoder or Decoded ICache.
In the micro-op queue they are fragmented into two micro-ops through a process called un-lamination,
one does the load and the other does the operation. A typical example is the following "load plus operation"
instruction:ADD RAX, [RBP+RSI]; rax := rax + LD( RBP+RSI ) The Intel section is a bit unclear because they don't make it very explicit obvious that this only applies to indexed addressing modes, and that if you don't use index addressing you potentially achieve higher throughput. This issue could be pretty critical for optimization of high IPC loops, on a par with many similar issues covered in your doc. In particular, it means jumping through a few hoops to be able to use a simpler addressing mode could be worth it - beyond the latency benefits already documented in your guide (and beyond the ability to use port 7 AGU for store address calculation as well). It might be nice to add it to your doc! There is an extensive investigation on this stackoverflow question, which is what prompted me to post here . See in particular the answer from Peter Cordes who shows the issue on Sandy Bridge. In another answer I have some tests that show the limitation is removed on Skylake, but we don't know exactly in which arch it was removed. The Intel doc is mostly silent on that topic (unlamination is only discussed in the one SB-specific section I linked above). If you have some other machines at your disposal I have some code here that makes it easy to test the behavior (on Linux). |