Agner`s CPU blog

Instruction Throughput on Skylake

Author:

Date: 2016-04-26 13:50

Agner wrote:

It is possible that the decoders have a higher throughput, but then there must be a bottleneck somewhere else. This will be hard to verify.

I'm starting to understand this better. Using Likwid and defining some custom events, I've determined that Skylake can sustain execution and retirement of 5 or 6 Âµops per cycle. This is ignoring jump/cc "macro-fusion", which would presumably boost us up to 7 or 8. The bottleneck appears to be the "renamer", which can only "issue" 4 Âµops per cycle.
The question is "What constitutes a Âµop for this stage?"

In 2.3.3.1 of the Intel Optimization Guide, when discussing Sandy Bridge it says: "The Renamer is the bridge between the in-order part in Figure 2-5, and the dataflow world of the Scheduler. It moves up to four micro-ops every cycle from the micro-op queue to the out-of-order engine. Although the renamer can send up to 4 micro-ops (unfused, micro-fused, or macro-fused) per cycle, this is equivalent to the issue port can dispatch six micro-ops per cycle."

The grammar is atrocious, but I think it means that while the Renamer can only move 4 Âµops, these can be micro-fused Âµops that will be "unlaminated" to a load Âµop and an action Âµop. From what I can tell, Skylake can move 6 fused Âµops per cycle from the DSB to the IDQ, but can only "issue" 4 fused Âµops per cycle from the IDQ. But since the scheduler only handles unfused Âµops, this means that we can "dispatch" up to twice that many depending on fusion.

The result of this is that while it is probably true to say that Skylake is "designed for a throughput of four instructions per clock cycle", instructions per clock cycle can be poor metric to use when comparing fused and unfused instructions. Previously, I'd naively thought that once the instructions were decoded to the DSB, that it didn't matter whether one expressed LOAD-OP as a single instruction, or as a separate LOAD then OP.

But if one is being constrained by the Renamer, it turns out that it can make a big difference in total execution time. For example, I'm finding that in a tight loop, this (two combined load-adds):

#define ASM_ADD_ADD_INDEX(in, sum1, sum2, index) \
__asm volatile ("add 0x0(%[IN], %[INDEX]), %[SUM1]\n" \
"add 0x8(%[IN], %[INDEX]), %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index))

Is about 20% faster than this (two separate loads and adds):

#define ASM_LOAD_LOAD_INDEX(in, sum1, sum2, index, tmp) \
__asm volatile ("mov 0x0(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM1]\n" \
"mov 0x8(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index), \
[TMP] "=r" (tmp))

While the hybrid (one and one) is the same speed as the fast version:

#define ASM_LOAD_ADD_INDEX(in, sum1, sum2, index, tmp) \
__asm volatile ("mov 0x0(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM1]\n" \
"add 0x8(%[IN], %[INDEX]), %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index), \
[TMP] "=r" (tmp))

What I don't understand yet is why all variations that directly increment %[IN] are almost twice as slow as the versions that use and increment %[INDEX]:

#define ASM_ADD_ADD_DIRECT(in, sum1, sum2) \
__asm volatile ("add 0x0(%[IN]), %[SUM1]\n" \
"add 0x8(%[IN]), %[SUM2]\n" \
"add $0x10, %[IN]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2))

I also don't understand yet why I get 30% faster speeds for loops small enough to fit in the LSD than when unrolled such that the number of Âµops requires the DSB. Apparently the Loop Stream Detector still plays a performance roll in some cases.

Reply To This Message

Previous Message

Next Message