Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Merging uops on Skylake - Travis - 2018-04-21
last replythread Merging uops on Skylake - Agner - 2018-04-22
last replythread Merging uops on Skylake - Travis - 2018-04-28
last replythread Merging uops on Skylake - Agner - 2018-04-28
replythread Merging uops on Skylake - Tacit Murky - 2018-05-20
reply Merging uops on Skylake - Agner - 2018-05-20
last replythread Merging uops on Skylake - Travis - 2018-05-22
last reply Merging uops on Skylake - Agner - 2018-05-23
last reply Merging uops on Skylake - Travis Downs - 2020-01-26
 
Merging uops on Skylake
Author:  Date: 2018-04-21 22:56
I have always assumed that the extra-uop inserted to merge flags behavior described in the uarch manual applies in Skylake also (the manual also mentions that it does).

Recently, however, during a discussion about in exactly which situations the merging uop is inserted [1] I tried measuring this presence of the merging uop and didn't find any at all even in the cases where one should definitely occur regardless of the precise behavior being debated in [1].

For example, I ran this test:


xor eax, eax
.top:
%rep 128
add rcx, 5
inc rax
jna .never
%endrep
dec rdi
jnz .top
ret
.never:
ud2

The jna instruction reads ZF and CF which are set by inc and add respectively, so this should certainly require a "merge" uop to be inserted.

However, none of the performance counters I checked, including the uops executed counters for all the ports, showed any evidence of the merging uop. For the 3 instruction sequence of add, inc, jna I always saw a total of 3 uops (note that there is no macro-fusion). This was true for the test above, and once that should not need any merging uop (e.g., reversing the position of the add and inc instructions).

All tests ran in 1.25 cycles for those 3 instructions, which I guess is the result of occasional port conflicts.

In your tests did you observe the flag merging uop via performance counter, or indirectly in some other way?

If it was via the performance counters, should the above test show evidence of the merging uop? Is it possible it has been eliminated in Skylake?


[1] In particular, whether the condition is (1) that a flag reading instruction reads any flag set by an instruction which is not the last flag-setting instruction or instead (2) that a flag-reading instruction reads a set of flags coming from two different instructions.

   
Merging uops on Skylake
Author: Agner Date: 2018-04-22 00:21
My tests show an extra µop retired when INC is followed by JBE. I have not tested which port it goes to. Strangely, I saw no extra µop when INC is followed by CMOVBE.
   
Merging uops on Skylake
Author:  Date: 2018-04-28 01:21
Isn't the extra uop you saw not simply just because inc doesn't macro-fuse with jbe? That is, I also see an "extra" uop with inc -> jbe, compared to say inc -> jne, but that is simply because jne macro fuses, and jbe doesn't. So you see 1 uop total for inc + jne and 2 for inc + jbe. If you separate the inc and jump with a nop, the macro-fusion is prevented and then both jbe and jne take the same number of uops (3 in total: one for the inc, one for the nop, and one for the jump). This also explains the apparent different behavior of cmov: no fusion is involved so cmov would behave identically in all scenarios if there truly is no merging uop.

Can you share a bit more of your test methodology specifically with respect to this so we can get to the bottom of it? I can run tests on Skylake in Windows or Linux as necessary.

   
Merging uops on Skylake
Author: Agner Date: 2018-04-28 05:35
You are right. There is no extra µop in your case. But there is an extra µop when BT (bit test, modifying carry flag only) is followed by CMOVBE (conditional move reading carry flag and zero flag).

I wonder if they have made special-case hardware for cases like INC / JBE even though this combination is likely to occur as a programming bug rather than a deliberate combination of conditions?

   
Merging uops on Skylake
Author: Tacit Murky Date: 2018-05-20 15:12
It's worth noting, that (in Intel CPUs at least) carry flag is renamed separately, and others (called SPAZO cluster) may require to have a merging uop all the time. So jg(e) and jl(e) should be tested as well.
   
Merging uops on Skylake
Author: Agner Date: 2018-05-20 23:25
Tacit Murky wrote:
It's worth noting, that (in Intel CPUs at least) carry flag is renamed separately, and others (called SPAZO cluster) may require to have a merging uop all the time. So jg(e) and jl(e) should be tested as well.
A search for SPAZO cluster gives this Intel patent
www.freepatentsonline.com/y2014/0310504.html
   
Merging uops on Skylake
Author:  Date: 2018-05-22 17:35
It makes sense that SPAZO (love the name!) is renamed separately from C, since those are exactly the flags that are updated by inc/dec, and the primary "problem" for flag merging and renaming is that set of flags which is updated by inc/dec and which doesn't include C updated by most other arithmetic expressions.

I'm not quite following how jg(e) would be more likely to cause a merging uop though: it takes both flag inputs from the SPAZO cluster, unlike say jna which takes one from each cluster, so this case should be easier (or at least no harder). If the jna case doesn't cause a merging uop as my test indicates, it seems certain that jge won't.

In any case I tested all of jl, jle, jg, jge with the same type of loop as in my first post (both with and without a nop between the inc and branch to disable/enable macro-fusion), and the results were as expected: 3 uops dispatched for the not-fused case and 2 uops dispatched for the fused case, so no evidence of a fusion uop at least in the dispatched performance counters.

There is one kind-of anomaly in all this though: the loops don't perform as fast as I'd expect: with only 3 uops you'd expect the add/inc/jcc loops (without macro-fusion) to execute in 1.0 cycles per iteration, but it is consistently 1.25 cycles. My assumption this because the add or inc occasionally steal port6 that the jcc needs delaying it by a cycle. With macro-fusion, we get to 1.0 cycles. In the case of disabling macro-fusion wtih a nop (rather than choosing something like jo that doesn't fuse), I get 1.12 cycles. I don't think this is related to merging uops though because the behavior is the same for cases like inc; add that shouldn't need to merge.

   
Merging uops on Skylake
Author: Agner Date: 2018-05-23 12:44
The bit scan instructions BSF and BSR modify the zero flag but leave all other flags unchanged. You may try JLE or CMOVLE after these instructions.
   
Merging uops on Skylake
Author:  Date: 2020-01-26 13:41
I don't find any merging uop for cmovbe. Rather I find that cmovbe seems to take 2 uops in *any* scenario, regardless of whether the flags were manipulated in a partial or full way beforehand.

uops.info also reports cmovbe as two uops.

Basically, all the conditional moves that use *both* the carry flag and any of the SPAZO flags take 2 uops (be, nbe).

I think what happened in Haswell or Skylake is that uops got widespread support for 3 inputs. Hence, merging ops not necessary in most cases because the operation could just take separate inputs for the register holding the separately renamed C and that taking the SPAZO group. This works for all (not macro fused) jcc instructions because they have no other register inputs. It even works for most cmov instructions, which have 2 inputs, as long as only *either* C or a SPAZO flag is needed. It fails for cmovbe, because that would need 2 flag inputs, hence 4 inputs. So those instructions take 2 uops - but it seems like that is "always", not dependent on the state of the flags.

This also maybe explains some of the macro fusion behavior. inc and dec couldn't fuse with any jump that needs the C flag, because that would create a uop with 4 inputs, if you assume the immediate 1 counts.

So in short, I still can't find any evidence of merging uops on Skylake.