Your table lists variable-count SHL and SHR as 2c throughput, 2c latency. It appears that the 2c latency is only for flags. My results match yours for consecutive SHL instructions, but SHL is faster if surrounded by instructions that write all flags without reading them. (This is one case where ADD 1 is preferable to INC). In that case, it can achieve 1.5c throughput. For SHL r,cl the latency from r to r, and from cl to r, is much less than 2c. (I measure more than 1c, but maybe only because of resource conflicts). I think only one of the three p06 uops is the actual shift that writes the dest reg (probably the same internally as SHLX/SHRX), while the other two are purely for flag-handling. We know it's 2c from input-flags -> output-flags, but I didn't measure the latency from r or cl to flags. I think the instruction table should say: lat=1 tput=1.5 with a note saying "EFLAGS dependency limits throughput to 2c for consecutive shifts, and resource conflicts raise the average latency for the register operands". That's a lot to stick in a note, but 2c/2c does not reflect the performance in real use-cases very well at all. It's still a lot worse than SHLX, but not as bad as that.
mov eax, 1000000000 ; I can't figure out how to get a PRE tag to not double-space, please fix if possible
mov ecx, 3
align 32
.loop:
add edx,1
add edx,1
shl edx, cl
add edx,1
add edx,1 sub rax, 1
jnz .loop perf counters from an otherwise-idle i7-6700k, using ocperf.py
5,228,964,721 cycles:u # 3.841 GHz
7,000,000,418 instructions:u # 1.34 insn per cycle
1,000,000,412 branches:u # 734.565 M/sec
8,000,128,015 uops_issued_any:u # 5876.614 M/sec
8,000,101,258 uops_executed_thread:u # 5876.594 M/sec
Without the SHL, the loop of course runs at the expected 4c per iter. The SHL slows it down by 1.229 cycles, not 2. Haswell goes from 4c to 5.296c, so the slowdown is higher (~1.30 instead of ~1.23). With 13 dependent ADD instructions and one SHL in the loop, Skylake goes from 13c to 14.35c, but Haswell goes from 13c to 14.19c. So it's very weird and inconsistent, with Haswell seeing lower SHL latency the more infrequent they are, but SKL doing better when they're more frequent. Results are fairly similar for SHL ecx, cl (so the shift-count input doesn't need to be ready early). I was also able to hit 1.5c throughput for independent shifts with the same count by breaking SHL's flag input-dependency:
.loop:
shl r8d, cl
add ebx,1 ; xor edx,edx also works here
shl r9d, cl
add esi,1
shl r10d, cl sub eax, 1 ; not DEC
jnz .loop 5,000,450,873 cycles:u # 3.898 GHz
7,000,000,393 instructions:u # 1.40 insn per cycle
1,000,000,387 branches:u # 779.520 M/sec
12,000,132,094 uops_issued_any:u # 9354.338 M/sec
12,000,102,844 uops_executed_thread:u # 9354.315 M/sec
Results are the same on HSW and SKL to within measurement error. 5c per iteration with 3 SHL instructions in the loop is 1.666c throughput, bottlenecked on p06 throughput (including the loop-branch which has to run on p6). 3*3 + 1 = 10 p06 uops, which takes at least 5 cycles to execute. Be careful of uop-cache issues when testing: making the loop longer creates a situation where the loop bottlenecks on the front-end, because it's too dense to fit in the uop cache. e.g. adding another xor/shl pair make a loop of 16 fused-domain uops which works as expected on SKL: 6.5 cycles per iter to execute 13 p06 uops, even though they're coming from the legacy decoders. But HSW only manages 8c throughput, apparently bottlenecked on the front-end. Using long instructions like ADD rsi, 12345 (7 bytes), and putting redundant REP prefixes on the add and shift instructions restores performance on HSW, as soon as it fits in the uop cache and can issue from the LSD. |