Vector Class Discussion

permute8 performance loss in VC++ (VS 2017)

Author:

Date: 2019-10-15 22:38

Hi,

I think the VC++ optimiser may have a problem with permute8 when generating the _mm256_permutevar8x32_epi32 path.

Vec8i permute8(Vec8i const a);

I was converting some AVX2 intrinsic code to use VCL and noticed a performance loss so I investigated the assembly output (VS 2017, release build, x64) and found that the generated code for permute8 is not optimal for the _mm256_permutevar8x32_epi32 path due to the way it creates the permmask.

This is my calling code:



const Vec32uc src = ...;

pair1_i = permute8<0, 4, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));

pair2_i = permute8<1, 5, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));

pair3_i = permute8<2, 6, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));

pair4_i = permute8<3, 7, V_DC, V_DC, V_DC, V_DC, V_DC, V_DC>(Vec8i(src));

The constant8ui<...>() call inside permute8 to create the permmask causes the VC++ optimiser to produce something like this:



mov	DWORD PTR u$765[rbp], r8d

mov	QWORD PTR u$765[rbp+4], 4

mov	QWORD PTR u$765[rbp+12], 0

mov	QWORD PTR u$765[rbp+20], 0

mov	DWORD PTR u$765[rbp+28], r8dvmovdqu	ymm0, YMMWORD PTR u$765[rbp]

vpermd	ymm1, ymm0, ymm2

So when I call permute8 four times in a row, each one has a group of movs before the vpermd call.

I commented out the constant8ui call and replaced it with this:



const __m256i permmask = _mm256_set_epi32(i7 & 7, i6 & 7, i5 & 7, i4 & 7, i3 & 7, i2 & 7, i1 & 7, i0 & 7);

That produces assembly like this:



vmovdqu	ymm0, YMMWORD PTR __ymm@0000000000000000000000000000000000000000000000000000000400000000

vpermd	ymm0, ymm0, ymm3

And I get much better performance, equal to my original intrinsic code.

So it might be worth introducing an ifdef for _MSC_VER.

Thanks,
Neil

Reply To This Message

Next Message

permute8 performance loss in VC++ (VS 2017) - Neil - 2019-10-15

permute8 performance loss in VC++ (VS 2017) new - Agner - 2019-10-15