Vector Class Discussion

 
thread Intrinsics vs Assembly - Nathan Kurz - 2013-04-05
last reply Intrinsics vs Assembly - Agner - 2013-04-06
 
Intrinsics vs Assembly
Author:  Date: 2013-04-05 15:46
Hi Agner --

The library looks great, and is very clear to read. I have a general question about the choice of intrinsics versus assembly versus templates libraries such as yours. With intrinsics, I'm having trouble convincing the compilers I've tried (gcc, clang, icc) to do things in the order I want, or to actually use the instructions I want. My particular case involves codecs for bit-packing integers SIMD. The general code looks like this:

vec4 = _mm_load_si128(in);
vec5 = _mm_load_si128(in); // second load to avoid data dependency

... (enough cycles for L1 latency)

vec0 = _mm_load_si128(in);
_mm_store_si128(out + 1, vec1);
vec2 = _mm_add_epi32(vec2, vec1);
vec3 = _mm_and_si128(vec3, mask);
vec4 = _mm_srli_epi32(vec4, 8);

vec1 = _mm_load_si128(in);
_mm_store_si128(out + 2, vec2);
vec3 = _mm_add_epi32(vec3, vec2);
vec4 = _mm_and_si128(vec4, mask);
vec5 = _mm_srli_epi32(vec5, 10);

...

This arrangement fits the execution ports of Sandy Bridge quite well. But the compilers all want to "optimize" it to something like this:

movdqu (%rdi), %xmm11 // read 'in' once from memory
movdqa %xmm11, %xmm0 // wait until %xmm11 is loaded
movdqa %xmm11, %xmm1 // copy 'in' reg to reg
...
movdqa %xmm11, %xmm4 // copy 'in' reg to reg
paddd %xmm0, %xmm2 // add previous
pand %xmm12, %xmm3 // mask
psrld $8, %xmm4

movdqa %xmm11, %xmm5 // copy 'in' reg to reg
paddd %xmm1 %xmm3 // add previous
pand %xmm12, %xmm4 // mask
psrld $10, %xmm5
....
movdqu %xmm1, 400(%rsi)
movdqu %xmm2, 416(%rsi)

The moves from memory on P23 have been replaced by register copies on P015, and the stores have all been put at the end. The instructions have been reordered so we spend a few extra cycles waiting for data. Instead of using P0, P1, P2, P4, and P5, we now have contention.

Does your approach make it easier to keep things in the desired ordering and executing the exact operations? Have you found any compiler tricks to make this more likely to happen? I haven't figured out how to do so without disabling all optimizations globally.

   
Intrinsics vs Assembly
Author: Agner Date: 2013-04-06 02:27
It's a matter of how the compiler optimizes. The compiler will often reuse a memory load to minimize cache contentions. But you are right, there are situations where it is better to re-load the same value in order to reduce the load on other execution ports or to make dependency chains shorter. In many cases, the best compilers (Gnu, Intel) optimize better than a decent assembly programmer does. But there are also many cases where compilers are doing incredibly silly things. If you have a very critical hotspot and the compiler is not optimizing it good enough, then the only alternative is to use assembly.

The tradeoff between copying a previously loaded value versus loading it again disappears with the AVX instruction set where you have non-destructive three-operand instructions.