Vector Class Discussion

permute8 performance loss in VC++ (VS 2017)
Author:  Date: 2019-10-15 22:38
Hi,

I think the VC++ optimiser may have a problem with permute8 when generating the _mm256_permutevar8x32_epi32 path.

Vec8i permute8(Vec8i const a);

I was converting some AVX2 intrinsic code to use VCL and noticed a performance loss so I investigated the assembly output (VS 2017, release build, x64) and found that the generated code for permute8 is not optimal for the _mm256_permutevar8x32_epi32 path due to the way it creates the permmask.

This is my calling code:


const Vec32uc src = ...;
pair1_i = permute8<0, 4, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));
pair2_i = permute8<1, 5, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));
pair3_i = permute8<2, 6, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));
pair4_i = permute8<3, 7, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));

The constant8ui<...>() call inside permute8 to create the permmask causes the VC++ optimiser to produce something like this:


mov DWORD PTR u$765[rbp], r8d
mov QWORD PTR u$765[rbp+4], 4
mov QWORD PTR u$765[rbp+12], 0
mov QWORD PTR u$765[rbp+20], 0
mov DWORD PTR u$765[rbp+28], r8d

vmovdqu ymm0, YMMWORD PTR u$765[rbp]
vpermd ymm1, ymm0, ymm2

So when I call permute8 four times in a row, each one has a group of movs before the vpermd call.

I commented out the constant8ui call and replaced it with this:


const __m256i permmask = _mm256_set_epi32(i7 & 7, i6 & 7, i5 & 7, i4 & 7, i3 & 7, i2 & 7, i1 & 7, i0 & 7);

That produces assembly like this:


vmovdqu ymm0, YMMWORD PTR __ymm@0000000000000000000000000000000000000000000000000000000400000000
vpermd ymm0, ymm0, ymm3

And I get much better performance, equal to my original intrinsic code.

So it might be worth introducing an ifdef for _MSC_VER.

Thanks,
Neil

 
thread permute8 performance loss in VC++ (VS 2017) - Neil - 2019-10-15
last reply permute8 performance loss in VC++ (VS 2017) new - Agner - 2019-10-15