uop micro-fusion on Intel SnB seems to be possible only when it doesn't create uops with more than 2 input dependencies. Intel's code analyzer (IACA, from https://software.intel.com/en-us/articles/intel-architecture-code-analyzer) knows about this, and real experiments on Sandybridge hardware confirm that it's real: See my answer to stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes. I didn't see any mention of this in your optimization manual or microarchitecture docs. I tested again with store instructions, as that's an example used in your microarch doc, and it seems they can only fuse when 1-reg addressing modes are used. For example, mov [rsi + 0 + rdi], eax ; produces as many fused as unfused uops. mov [rsi + 0], eax ; produces 1 fused-domain, 2 unfused-domain uops I'm doing all my testing on 64bit Linux, on an i5 2500k (SnB). I just tested a 32bit binary, and got the same results, since your example did use 32bit registers. Same result: mov [esi+edi], eax can't micro-fuse. Assembled/linked with:
yasm -f elf32 uop-test.s && ld -m elf_i386 uop-test.o -o 32.uop-test
file(1) says it's a 32bit elf statically linked binary, so I'm pretty sure I did this right. :P In the Core2/Nehalem section of your architecture guide, you say:
A fused μop can have three input dependencies, while an unfused μop can have only two. I think this is wrong. I haven't tested Core2 or Nehalem, just SnB, but the SnB/IvB section simply refers back to the Nehalem section without mentioning any caveats. I'm sure it's wrong for SnB. IACA with -arch NHM doesn't show micro-fusion for 2-reg addresses for stores, or ALU ops, so this needs testing on Nehalem hardware, too. (IACA can't analyse for pre-Nehalem arches.)
off-topic: It'd be nice if the microarch doc didn't refer back to how things were somewhat different on older architectures quite as much. It gets to be a problem for micro-op fusion, where SnB refers you back to the Nehalem section AND the P-M section. At least the SnB section doesn't have anything new to add. I think it might be a good idea to have the Nehalem section not refer back to P-M, or at least summarize anything it doesn't say itself, though, since two levels of recursion is pushing it.
That's about the only bad thing I can say about your work, though! Overall, it's an amazing resource. Making each section stand alone would bloat things, and make it less obvious when things were the same for multiple CPUs, so that wouldn't be good, either. |