Vector Class Discussion

Intrinsics vs Assembly
Author:  Date: 2013-04-05 15:46
Hi Agner --

The library looks great, and is very clear to read. I have a general question about the choice of intrinsics versus assembly versus templates libraries such as yours. With intrinsics, I'm having trouble convincing the compilers I've tried (gcc, clang, icc) to do things in the order I want, or to actually use the instructions I want. My particular case involves codecs for bit-packing integers SIMD. The general code looks like this:

vec4 = _mm_load_si128(in);
vec5 = _mm_load_si128(in); // second load to avoid data dependency

... (enough cycles for L1 latency)

vec0 = _mm_load_si128(in);
_mm_store_si128(out + 1, vec1);
vec2 = _mm_add_epi32(vec2, vec1);
vec3 = _mm_and_si128(vec3, mask);
vec4 = _mm_srli_epi32(vec4, 8);

vec1 = _mm_load_si128(in);
_mm_store_si128(out + 2, vec2);
vec3 = _mm_add_epi32(vec3, vec2);
vec4 = _mm_and_si128(vec4, mask);
vec5 = _mm_srli_epi32(vec5, 10);


This arrangement fits the execution ports of Sandy Bridge quite well. But the compilers all want to "optimize" it to something like this:

movdqu (%rdi), %xmm11 // read 'in' once from memory
movdqa %xmm11, %xmm0 // wait until %xmm11 is loaded
movdqa %xmm11, %xmm1 // copy 'in' reg to reg
movdqa %xmm11, %xmm4 // copy 'in' reg to reg
paddd %xmm0, %xmm2 // add previous
pand %xmm12, %xmm3 // mask
psrld $8, %xmm4

movdqa %xmm11, %xmm5 // copy 'in' reg to reg
paddd %xmm1 %xmm3 // add previous
pand %xmm12, %xmm4 // mask
psrld $10, %xmm5
movdqu %xmm1, 400(%rsi)
movdqu %xmm2, 416(%rsi)

The moves from memory on P23 have been replaced by register copies on P015, and the stores have all been put at the end. The instructions have been reordered so we spend a few extra cycles waiting for data. Instead of using P0, P1, P2, P4, and P5, we now have contention.

Does your approach make it easier to keep things in the desired ordering and executing the exact operations? Have you found any compiler tricks to make this more likely to happen? I haven't figured out how to do so without disabling all optimizations globally.

thread Intrinsics vs Assembly - Nathan Kurz - 2013-04-05
last reply Intrinsics vs Assembly new - Agner - 2013-04-06