Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

SHL/SHR r,cl latency is lower than throughput
Author:  Date: 2017-05-27 17:00

Your table lists variable-count SHL and SHR as 2c throughput, 2c latency. It appears that the 2c latency is only for flags. My results match yours for consecutive SHL instructions, but SHL is faster if surrounded by instructions that write all flags without reading them. (This is one case where ADD 1 is preferable to INC). In that case, it can achieve 1.5c throughput.

For SHL r,cl the latency from r to r, and from cl to r, is much less than 2c. (I measure more than 1c, but maybe only because of resource conflicts). I think only one of the three p06 uops is the actual shift that writes the dest reg (probably the same internally as SHLX/SHRX), while the other two are purely for flag-handling. We know it's 2c from input-flags -> output-flags, but I didn't measure the latency from r or cl to flags.

I think the instruction table should say: lat=1 tput=1.5 with a note saying "EFLAGS dependency limits throughput to 2c for consecutive shifts, and resource conflicts raise the average latency for the register operands". That's a lot to stick in a note, but 2c/2c does not reflect the performance in real use-cases very well at all. It's still a lot worse than SHLX, but not as bad as that.


mov eax, 1000000000 ; I can't figure out how to get a PRE tag to not double-space, please fix if possible
mov ecx, 3
align 32
.loop:
add edx,1
add edx,1
shl edx, cl
add edx,1
add edx,1

sub rax, 1
jnz .loop

perf counters from an otherwise-idle i7-6700k, using ocperf.py
5,228,964,721 cycles:u # 3.841 GHz
7,000,000,418 instructions:u # 1.34 insn per cycle
1,000,000,412 branches:u # 734.565 M/sec
8,000,128,015 uops_issued_any:u # 5876.614 M/sec
8,000,101,258 uops_executed_thread:u # 5876.594 M/sec

Without the SHL, the loop of course runs at the expected 4c per iter. The SHL slows it down by 1.229 cycles, not 2. Haswell goes from 4c to 5.296c, so the slowdown is higher (~1.30 instead of ~1.23).

With 13 dependent ADD instructions and one SHL in the loop, Skylake goes from 13c to 14.35c, but Haswell goes from 13c to 14.19c. So it's very weird and inconsistent, with Haswell seeing lower SHL latency the more infrequent they are, but SKL doing better when they're more frequent.

Results are fairly similar for SHL ecx, cl (so the shift-count input doesn't need to be ready early).

I was also able to hit 1.5c throughput for independent shifts with the same count by breaking SHL's flag input-dependency:


.loop:
shl r8d, cl
add ebx,1 ; xor edx,edx also works here
shl r9d, cl
add esi,1
shl r10d, cl

sub eax, 1 ; not DEC
jnz .loop

5,000,450,873 cycles:u # 3.898 GHz
7,000,000,393 instructions:u # 1.40 insn per cycle
1,000,000,387 branches:u # 779.520 M/sec
12,000,132,094 uops_issued_any:u # 9354.338 M/sec
12,000,102,844 uops_executed_thread:u # 9354.315 M/sec

Results are the same on HSW and SKL to within measurement error. 5c per iteration with 3 SHL instructions in the loop is 1.666c throughput, bottlenecked on p06 throughput (including the loop-branch which has to run on p6). 3*3 + 1 = 10 p06 uops, which takes at least 5 cycles to execute.

Be careful of uop-cache issues when testing: making the loop longer creates a situation where the loop bottlenecks on the front-end, because it's too dense to fit in the uop cache. e.g. adding another xor/shl pair make a loop of 16 fused-domain uops which works as expected on SKL: 6.5 cycles per iter to execute 13 p06 uops, even though they're coming from the legacy decoders. But HSW only manages 8c throughput, apparently bottlenecked on the front-end. Using long instructions like ADD rsi, 12345 (7 bytes), and putting redundant REP prefixes on the add and shift instructions restores performance on HSW, as soon as it fits in the uop cache and can issue from the LSD.

 
thread Test results for Broadwell and Skylake new - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake new - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake new - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual new - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual new - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake new - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake new - Jess - 2016-02-11
last reply Description of discrepancy new - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake new - T - 2016-06-18
reply Instruction Throughput on Skylake new - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake new - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B new - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - - - 2017-06-19
replythread Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake new - - - 2017-07-05
last replythread Test results for Broadwell and Skylake new - - - 2017-07-12
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake new - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake new - Travis - 2017-07-13
last reply Official information about uOps and latency SNB+ new - SEt - 2017-07-17