Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Haswell register renaming / unfused limits
Author:  Date: 2017-05-12 20:22
Agner, your insn table says cmovcc r,m and adc r,m don't micro-fuse at all on HSW/SKL, but that doesn't match my experiments. They do micro-fuse on both HSW and SKL. (I didn't check SBB r,m).

I assume indexed addressing modes for cmov/adc are still fused in the decoder and un-laminated later, but I didn't check that. All I can see is that they're not micro-fused when they issue/retire.

I just made a major update to stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes, after testing things on HSW and SKL.

Peter Cordes wrote:

Interesting things that could be tested:
  • micro-fused FMA with a base+index addressing mode should be a 4-input fused-domain uop. (or maybe this will be unlaminated)
  • On Skylake, ADCX / ADOX if they micro-fuse. (ADC doesn't, according to the instruction tables). Or even just ADC r,r might be interesting.
Answer: FMA/ADC/CMOV on HSW and SKL are un-laminated with indexed addressing modes, so we can't have 4-input fused-domain uops.

This applies even to ADC/CMOV on Haswell, where they decode to 2 uops. So that's weird. I'm guessing they simply left those instructions alone from IvyBridge; maybe they ran into deadlines and didn't have time to change them until Broadwell. i.e. maybe they decided not to invest time in getting 3-input micro-fused uop support right when they knew they really wanted to make the register-source version a single uop (that would behave like FMA and un-laminate indexed addressing modes,).

Unanswered questions: does un-lamination happen before the IDQ, or only at issue?

---------------

Re: Tacit Murky's suggestion to use a store to achieve 7 unfused-domain uops per clock: Good idea, this worked. Surprisingly, it even got it to run at 1.0 iterations per clock on SKL, with none of the stores stealing p23 from the loads.

;HTML pre is double-spacing this, so I'm just going to leave it flat :/
.loop: ; HSW: 1.12c / iter. SKL: 1.0001c
add edx, [rsp]
mov [rax], edi
blsi ebx, [rdi]
dec ecx
jnz .loop

SKL: 7 unfused uops per clock. HSW: 6.25. Register-reads per clock: 6 (not counting flags) total on SKL.

In my previous testing, I had assumed 32 vs. 64b operand-size didn't matter. But this loop runs at 1 iter per 1.12c with a 64b add, vs. 1.000c with a 32b add, on SKL. Totally bizarre. All three memory ops are in separate cache lines. I forget if that mattered.

The store has to be a simple addressing mode to run on port7, which is of course essential. IDK why HSW only runs this at 1.12c per iter, not nearly as close to 1.00 as SKL.

blsi r, [r+r] is 2 fused-domain uops, which is unexpected. (Changing it to an add is also a slowdown, I think because of reading the destination register).


With maximum register-reads:

.loop: ; HSW: 1.75c SKL: 1.42c.
add edx, [rsp+rsi]
mov [rax], edi ; An indexed store brings us up to HSW: 1.90c SKL: 1.55c
add ebx, [rdi+r8]
sub ecx,r9d ; = 1
jnz .loop

Register reads per clock: HSW: 10/1.75 = 5.71 /c total. SKL: 7.04/c total. Or with an indexed store: HSW: 5.79/c total GPRs read, SKL: 11/1.55 = 7.08/c.

-------------

To test for issue/rename bottlenecks vs. execution bottlenecks, I could make the loop longer and have a section of all-micro-fused instructions, and then a section of "easy" instructions. So the OOO core can easily keep up on average if the front-end issues 4 fused-domain uops per clock. But to do that, it would have to issue 8 unfused uops in a single cycle without stalling if there are at least 7 micro-fused uops in a row. I'll try that later, when I have time to get back to this.

 
thread Test results for Broadwell and Skylake new - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake new - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake new - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual new - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual new - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake new - Jess - 2016-02-11
last reply Description of discrepancy new - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake new - T - 2016-06-18
reply Instruction Throughput on Skylake new - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake new - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B new - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - - - 2017-06-19
replythread Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake new - - - 2017-07-05
last replythread Test results for Broadwell and Skylake new - - - 2017-07-12
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake new - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake new - Travis - 2017-07-13
reply Official information about uOps and latency SNB+ new - SEt - 2017-07-17
last replythread Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07
last reply Test results for Broadwell and Skylake new - Agner - 2020-10-11