Thank you very much for the good analysis. There is one restriction that isn't mentioned in your document. In Sandy Bridge and later processors, instructions that Macro-op fusion can be applied (add, sub, and, cmp, test, inc, dec) seem to be decoded only with simple decoders (3 of 4). This restriction does not exist in Nehalem or earlier processors. Actually there is a decoded uop cache, and OoO backend executes these instruction in 3 per cycle throughput, so it would have little impact on the real world performance. But it might be a bit different story in Haswell, which has wider execution ports. |