Intel Floating Point Executing 3 to 4 Times Faster Than it Should. MAKES NO SENSE

News and research about CPU microarchitecture and software optimization
Post Reply
Elhardt
Posts: 2
Joined: 2021-01-20, 1:33:49

Intel Floating Point Executing 3 to 4 Times Faster Than it Should. MAKES NO SENSE

Post by Elhardt » 2021-10-04, 5:09:12

Hello Agner. I would sure like an explanation (if you have one) for the insanely fast floating point speeds I'm getting on Ivy Bridge and Haswell processors that seem to defy the laws of physics.

Last year I was benchmarking code I was writing to calculate gangs of sine waves for additive synthesis using SSE in assembly language. All I did was change one instruction at the start of the loop from something like:

movss xmm0,[esi]
movss xmm1,xmm0
to:
movss xmm0,[esi]
movss xmm1,[esi]

I wanted to get rid of the dependency. That caused the entire loop of about a dozen instructions to execute about 4 times faster. The only explanation I could think of was that Intel was so clever that they were using other unused ALUs in the vector processor for running scalar code. But when I vectorized the code, it ran just as fast. I then moved some old Pentium x87 floating point code for doing 3D transforms to my Ivy Bridge processor. The loop contains 9 multiplies, 9 adds, moves of data in and out of registers, loop overhead, and a few dependencies. It takes about 35 clock cycles on a Pentium but only about 12 on an Ivy Bridge (that's an average of 0.66 cycles per floating point instruction). Ivy Bridge Multiply latencies are even longer than those on the Pentium (5 clocks vs 3), and still it's faster. I had no idea Intel was working on speeding up the obsolete x87 and yet it's insanely fast.

I just spent a day slowly deleting instructions from loops and trying different arrangements of instructions to try to understand what's going on. I encountered a number of things that didn't make much sense, but I found a really big one that really affects code speed. Here's an example:

This loop executes in 5 clock cycles (registers filled with 1.0 so as not to overflow):

loop:
mulss xmm0,xmm1
dec rcx
jnz loop

This loop executes in 1 clock cycle:

loop:
movss xmm0,[rsi]
mulss xmm0,xmm1
dec rcx
jnz loop

Pre-loading the destination register of the multiply speeds up this loop by 5 times. Loading the source register does not speed it up. There's no way that loop should be able to run faster than the 5 clock cycle latency of the multiply instruction, and yet it does. This should be impossible. Even more bizarre is the fact that I've added an additional dependency with the movss instruction. This seems to be one of the main reasons I'm getting the insane speeds I am. I would really like some kind of insight into what might be going on if you have any idea.

Thanks,
Elhardt

agner
Site Admin
Posts: 75
Joined: 2019-12-27, 18:56:25
Contact:

Re: Intel Floating Point Executing 3 to 4 Times Faster Than it Should. MAKES NO SENSE

Post by agner » 2021-10-04, 10:51:08

I will suggest that you ask this question at stackoverflow.com

...
Posts: 4
Joined: 2021-10-04, 11:30:57

Re: Intel Floating Point Executing 3 to 4 Times Faster Than it Should. MAKES NO SENSE

Post by ... » 2021-10-04, 11:35:55

Elhardt wrote:
2021-10-04, 5:09:12
Pre-loading the destination register of the multiply speeds up this loop by 5 times. Loading the source register does not speed it up. There's no way that loop should be able to run faster than the 5 clock cycle latency of the multiply instruction, and yet it does. This should be impossible. Even more bizarre is the fact that I've added an additional dependency with the movss instruction.
Quite the opposite - adding the movss breaks the dependency chain on xmm0 allowing multiple loop iterations to be executed in parallel. The first example, on the other hand, enforces a dependency chain on xmm0 across loop cycles, inhibiting parallel execution across loop iterations.

Elhardt
Posts: 2
Joined: 2021-01-20, 1:33:49

Re: Intel Floating Point Executing 3 to 4 Times Faster Than it Should. MAKES NO SENSE

Post by Elhardt » 2021-11-01, 4:18:56

Quite the opposite - adding the movss breaks the dependency chain on xmm0 allowing multiple loop iterations to be executed in parallel. The first example, on the other hand, enforces a dependency chain on xmm0 across loop cycles, inhibiting parallel execution across loop iterations.
I've been experimenting some more with this. The reason I guess I got it the opposite was because I didn't think a movss into a register could happen until the multiply was finished using that same register. I believe what I'm experiencing is internal register renaming allowing concurrent loops, but it's very particular about when the processor does it. Moving from another register instead of from memory doesn't get me the incredible speed gains. Multiplying directly from memory doesn't either. It specifically has to be a move from memory into a register first before the multiply. One would think Intel would speed everything up using this method, but nope. For those interested who might be reading this thread I'll just drop a few more complete examples below with cycle times to show the oddities. I tried to force the full latency of the loop by moving the result of the multiply back to memory, but that didn't slow it down. That means many stores to memory have to be gathering up and waiting around until their associated multiplies are completed. Very odd.

[ 5 cycles ]
mulss xmm0,xmm1
movss [rdi],xmm0
dec rcx
jnz loop2

[ 5 cycles ]
mulss xmm0,[rsi]
movss [rdi],xmm0
dec rcx
jnz loop2

[ 6 cycles ]
movss xmm0,xmm2
mulss xmm0,xmm1
movss [rdi],xmm0
dec rcx
jnz loop2

[ 1.2 cycles ]
movss xmm0,[rsi]
mulss xmm0,xmm1
movss [rdi],xmm0
dec rcx
jnz loop2

Below is another example of two loops that accomplish the same goal, but one takes 4 times longer.

[ 8.1 cycles ]
movaps xmm0,xmm1
dpps xmm0,[rbp],0xFF
dec rcx
jnz loop2

[ 2.1 cycles ]
movaps xmm0,[rbp]
dpps xmm0,xmm1,0xFF
dec rcx
jnz loop2

agner
Site Admin
Posts: 75
Joined: 2019-12-27, 18:56:25
Contact:

Re: Intel Floating Point Executing 3 to 4 Times Faster Than it Should. MAKES NO SENSE

Post by agner » 2021-11-01, 12:09:04

The explanation is that mulss xmm0,xmm1 is using 32 bits of a 128 bits register. The remaining 96 bits of xmm0 are unchanged. This gives you a false dependence that delays the next iteration. Possible solutions: use the whole register (mulps xmm0, xmm1) or clear or reload the register between iterations.

Please note that this blog is not a forum for programming help. Your discussion belongs on stackoverflow.com

Post Reply