Agner wrote:
It is possible that the decoders have a higher
throughput, but then there must be a bottleneck
somewhere else. This will be hard to verify.
I'm starting to understand this better. Using Likwid and defining some custom events, I've determined that Skylake can sustain execution and retirement of 5 or 6 µops per cycle. This is ignoring jump/cc "macro-fusion", which would presumably boost us up to 7 or 8. The bottleneck appears to be the "renamer", which can only "issue" 4 µops per cycle.
The question is "What constitutes a µop for this stage?"In 2.3.3.1 of the Intel Optimization Guide, when discussing Sandy Bridge it says: "The Renamer is the bridge between the in-order part in Figure 2-5, and the dataflow world of the Scheduler. It moves up to four micro-ops every cycle from the micro-op queue to the out-of-order engine. Although the renamer can send up to 4 micro-ops (unfused, micro-fused, or macro-fused) per cycle, this is equivalent to the issue port can dispatch six micro-ops per cycle." The grammar is atrocious, but I think it means that while the Renamer can only move 4 µops, these can be micro-fused µops that will be "unlaminated" to a load µop and an action µop. From what I can tell, Skylake can move 6 fused µops per cycle from the DSB to the IDQ, but can only "issue" 4 fused µops per cycle from the IDQ. But since the scheduler only handles unfused µops, this means that we can "dispatch" up to twice that many depending on fusion. The result of this is that while it is probably true to say that Skylake is "designed for a throughput of four instructions per clock cycle", instructions per clock cycle can be poor metric to use when comparing fused and unfused instructions. Previously, I'd naively thought that once the instructions were decoded to the DSB, that it didn't matter whether one expressed LOAD-OP as a single instruction, or as a separate LOAD then OP. But if one is being constrained by the Renamer, it turns out that it can make a big difference in total execution time. For example, I'm finding that in a tight loop, this (two combined load-adds): #define ASM_ADD_ADD_INDEX(in, sum1, sum2, index) \
__asm volatile ("add 0x0(%[IN], %[INDEX]), %[SUM1]\n" \
"add 0x8(%[IN], %[INDEX]), %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index))
Is about 20% faster than this (two separate loads and adds):
#define ASM_LOAD_LOAD_INDEX(in, sum1, sum2, index, tmp) \
__asm volatile ("mov 0x0(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM1]\n" \
"mov 0x8(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index), \
[TMP] "=r" (tmp)) While the hybrid (one and one) is the same speed as the fast version: #define ASM_LOAD_ADD_INDEX(in, sum1, sum2, index, tmp) \
__asm volatile ("mov 0x0(%[IN], %[INDEX]), %[TMP]\n" \
"add %[TMP], %[SUM1]\n" \
"add 0x8(%[IN], %[INDEX]), %[SUM2]\n" \
"add $0x10, %[INDEX]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2), \
[INDEX] "+&r" (index), \
[TMP] "=r" (tmp))
What I don't understand yet is why all variations that directly increment %[IN] are almost twice as slow as the versions that use and increment %[INDEX]:
#define ASM_ADD_ADD_DIRECT(in, sum1, sum2) \
__asm volatile ("add 0x0(%[IN]), %[SUM1]\n" \
"add 0x8(%[IN]), %[SUM2]\n" \
"add $0x10, %[IN]\n" : \
[IN] "+&r" (in), \
[SUM1] "+&r" (sum1), \
[SUM2] "+&r" (sum2)) I also don't understand yet why I get 30% faster speeds for loops small enough to fit in the LSD than when unrolled such that the number of µops requires the DSB. Apparently the Loop Stream Detector still plays a performance roll in some cases. |