Agner wrote:
The results show that instructions with three
input dependencies are fusing alright and use only a
single entry in the micro-operation cache
I found official confirmation in Intel's optimization manual that we're both right (see Section 2.3.2.4: "Micro-op Queue and the Loop Stream Detector (LSD)"), we were just measuring different things.
SnB-family still micro-fuses such instructions in the decoders and uop-cache, but "un-laminates" uops with an indexed addressing mode before the issue/rename stage. The uop format used in the ROB must be different from the format in the uop cache. The unfused-domain scheduler (RS) must still handle uops with indexed addressing modes, because pure loads with complex addressing modes are still a single p23 uop.
Tacit Murky's earlier post says that the un-lamination happens as uops are written to the IDQ, so the loop-buffer size is measured in un-laminated uops.
For the purposes of pipeline width and tight loops, indexed addressing modes don't micro-fuse. The 4-wide issue width is after un-lamination.
For the record, un-laminate is not a normal English word, but delaminate is. I want to put quotes around it every time I type it. >.<
BTW, I updated my answer on StackOverflow with this info. |