Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

List Messageboards

Unlamination of micro-fused ops in SKL and earlier

Author:

Date: 2016-09-09 19:36

There is an interesting effect which changed in Skylake (or at least some architecture after Sandy Bridge, up to and including Skylake), but isn't covered in your manual. It concerns the behavior of micro-fused instructions with *complex* memory source or destination operands. Here complex means with base and index registers, so something like

add rax, [rbx + rcx]

In Sandy Bridge, this doesn't seem to micro-fuse in the same way as simpler addressing modes such as:

add rax, [rbx + 16]

In particular, while it seems that the complex address modes fuse in the uop cache, the constituent ops are later "unlaminated" and consume rename and retirement resources. In particular, this means that you cannot achieve 4 micro-fused uops/cycle throughput with these addressing modes. The Intel optimization doc does touch on it briefly in 2.3.2.4 Micro-op Queue and the Loop Stream Detector (LSD):

In particular, loads combined with computational operations and all stores, when used
with indexed addressing, are represented as a single micro-op in the decoder or Decoded ICache.
In the micro-op queue they are fragmented into two micro-ops through a process called un-lamination,
one does the load and the other does the operation. A typical example is the following "load plus operation"
instruction:
ADD RAX, [RBP+RSI]; rax := rax + LD( RBP+RSI )

The Intel section is a bit unclear because they don't make it very explicit obvious that this only applies to indexed addressing modes, and that if you don't use index addressing you potentially achieve higher throughput.

This issue could be pretty critical for optimization of high IPC loops, on a par with many similar issues covered in your doc. In particular, it means jumping through a few hoops to be able to use a simpler addressing mode could be worth it - beyond the latency benefits already documented in your guide (and beyond the ability to use port 7 AGU for store address calculation as well).

It might be nice to add it to your doc! There is an extensive investigation on this stackoverflow question, which is what prompted me to post here . See in particular the answer from Peter Cordes who shows the issue on Sandy Bridge. In another answer I have some tests that show the limitation is removed on Skylake, but we don't know exactly in which arch it was removed. The Intel doc is mostly silent on that topic (unlamination is only discussed in the one SB-specific section I linked above). If you have some other machines at your disposal I have some code here that makes it easy to test the behavior (on Linux).

Reply To This Message

Previous Message

Test results for Broadwell and Skylake new - Agner - 2015-12-26

Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26

Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27

Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27

Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04

Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18

Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12

Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28

Test results for Broadwell and Skylake new - Agner - 2015-12-29

Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04

Test results for Broadwell and Skylake new - Agner - 2016-01-05

Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09

Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05

Minor bug in the microarchitecture manual new - SHK - 2016-01-10

Minor bug in the microarchitecture manual new - Agner - 2016-01-16

Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12

Test results for Broadwell and Skylake new - Jess - 2016-02-11

Description of discrepancy new - Nathan Kurz - 2016-03-13

Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22

Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23

Instruction Throughput on Skylake new - Agner - 2016-04-24

Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26

Instruction Throughput on Skylake new - Agner - 2016-04-27

Instruction Throughput on Skylake new - T - 2016-06-18

Instruction Throughput on Skylake new - Agner - 2016-06-19

Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08

Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11

Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17

Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11

Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11

Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12

Instruction Throughput on Skylake new - T - 2016-08-08

Unlamination of micro-fused ops in SKL and earlier - Travis - 2016-09-09

32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11

32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28

32B store-forwarding is slower than 16B new - Agner - 2017-06-28

SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30

Test results for Broadwell and Skylake new - Agner - 2017-05-30

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30

Test results for Broadwell and Skylake new - - - 2017-06-19

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20

Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26

Test results for Broadwell and Skylake new - - - 2017-07-05

Test results for Broadwell and Skylake new - - - 2017-07-12

Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19

Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28

Test results for Broadwell and Skylake new - Travis - 2017-06-29

Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30

Test results for Broadwell and Skylake new - Travis - 2017-07-13

Official information about uOps and latency SNB+ new - SEt - 2017-07-17

Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07

Test results for Broadwell and Skylake new - Agner - 2020-10-11

List Messageboards