Vector Class Discussion

thread permute8 performance loss in VC++ (VS 2017) - Neil - 2019-10-15
last reply permute8 performance loss in VC++ (VS 2017) - Agner - 2019-10-15
permute8 performance loss in VC++ (VS 2017)
Author:  Date: 2019-10-15 22:38

I think the VC++ optimiser may have a problem with permute8 when generating the _mm256_permutevar8x32_epi32 path.

Vec8i permute8(Vec8i const a);

I was converting some AVX2 intrinsic code to use VCL and noticed a performance loss so I investigated the assembly output (VS 2017, release build, x64) and found that the generated code for permute8 is not optimal for the _mm256_permutevar8x32_epi32 path due to the way it creates the permmask.

This is my calling code:

const Vec32uc src = ...;
pair1_i = permute8<0, 4, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));
pair2_i = permute8<1, 5, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));
pair3_i = permute8<2, 6, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));
pair4_i = permute8<3, 7, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));

The constant8ui<...>() call inside permute8 to create the permmask causes the VC++ optimiser to produce something like this:

mov DWORD PTR u$765[rbp], r8d
mov QWORD PTR u$765[rbp+4], 4
mov QWORD PTR u$765[rbp+12], 0
mov QWORD PTR u$765[rbp+20], 0
mov DWORD PTR u$765[rbp+28], r8d

vmovdqu ymm0, YMMWORD PTR u$765[rbp]
vpermd ymm1, ymm0, ymm2

So when I call permute8 four times in a row, each one has a group of movs before the vpermd call.

I commented out the constant8ui call and replaced it with this:

const __m256i permmask = _mm256_set_epi32(i7 & 7, i6 & 7, i5 & 7, i4 & 7, i3 & 7, i2 & 7, i1 & 7, i0 & 7);

That produces assembly like this:

vmovdqu ymm0, YMMWORD PTR __ymm@0000000000000000000000000000000000000000000000000000000400000000
vpermd ymm0, ymm0, ymm3

And I get much better performance, equal to my original intrinsic code.

So it might be worth introducing an ifdef for _MSC_VER.


permute8 performance loss in VC++ (VS 2017)
Author: Agner Date: 2019-10-15 23:54
The Microsoft compiler is not as good at optimizing as Gcc and Clang. MS have promised that there will soon be Clang plugin to MS Visual Studio with full integration into the IDE. This will be useful.

A replacement of constant8ui by _mm256_set_epi32 would not hurt the other compilers, so I might do this in the next update.