Agner wrote:
Peter Cordes wrote:
uop micro-fusion on Intel SnB seems to be possible
only when it doesn't create uops with more than 2
input dependencies. I have now tested
this on Sandy Bridge, Ivy Bridge, Haswell and
Broadwell. I have not had access to test on a Skylake
yet.The results show that instructions with three
input dependencies are fusing alright and use only a
single entry in the micro-operation cache ...
I guess you didn't see my response on Stackoverflow to your 2nd answer there.
Our test results disagree. I see a change in the uop perf counters, and an increase in the clock cycles taken, when changing from or eax, [rsi] to or eax, [rsi+rdi]. I didn't try to measure uop-cache slots, just a total cycle count, and fused/unfused uop counts. My full test code, and the Linux perf command I used to get data from the performance counters, is posted on stackoverflow.
Based on Tacit Murky's information that SnB's internal uop format doesn't have room for a micro-fused index register, maybe 2-register addressing modes can still micro-fuse in the uop cache, but not in the pipeline where the ROB tracks them.
Did your test results make an assumption about uops in the uop cache being the same as fused-domain uops in the pipeline?
I re-ran my test after seeing your response, and I'm still sure I'm seeing 2-register source operands NOT micro-fusing. If I'm wrong, can you please have a look and help me figure out what's wrong with my test procedure? I've been using
ocperf.py stat -r4 -e task-clock,cycles,instructions,uops_issued.any,uops_dispatched.thread,uops_retired.all,uops_retired.retire_slots,stalled-cycles-frontend,stalled-cycles-backend ./uop-test
I'm essentially testing fused-domain uops against the 4-wide limit of the pipeline for issuing / retiring 4 fused-domain uops per clock. Some of my fused-domain uops are NOPs, to avoid execution port unfused-domain bottlenecks on SnB. |