VZEROUPPER issue with Zen4 in 32-bit mode?

jkivilin · Post by **jkivilin** » 2023-06-03, 16:10:59

I've ran into interesting behaviour on Zen4 (7900X) with mixed VEX and non-VEX code. It seems that, in 32-bit mode, VZEROUPPER is not fully clearing some state for YMM/XMM registers which causes non-VEX code run at slower speed after YMM usage.

I'm running following code in 32-bit mode and 64-bit mode. Outer loop (1k iterations) does one operation using YMM registers followed by VZEROUPPER. Then inner loop (1M iterations) runs SSE2 instructions:

Code: Select all

  /* Test loop with VZEROUPPER & YMM. */
  asm volatile (
    ".align 64\n"
    "pcmpeqd %%xmm0, %%xmm0\n"
    :
    :
    : "memory"
  );
  j = 0; i = 0;
  start_cycles = rdtsc();
    asm volatile (
      ".align 64\n1:\n"
      "vmovdqa %%ymm0, %%ymm1\n"
      "vzeroupper\n"

      "xor %0, %0\n"
      ".align 64\n2:\n"
      "movdqa %%xmm0, %%xmm1\n"
      "paddd %%xmm1, %%xmm0\n"
      "lea 1(%0),%0\n"
      "cmp %2,%0\n"
      "jb 2b\n"

      "lea 1(%1),%1\n"
      "cmp %3,%1\n"
      "jb 1b\n"
      : "+r" (j), "+r" (i)
      : "r" (NUM_INNER_ITER), "r" (NUM_ITER)
      : "memory", "cc"
    );
  end_cycles = rdtsc();
  printf("%6s-test3: %ld*%ld iterations, %lld cycles\n", arch,
	 (long)i, (long)j, (long long int)(end_cycles - start_cycles));

In 64-bit mode, loops run in 1013160670 cycles. In 32-bit mode, run-time is doubled to 2004987382 cycles.

Removing `vmovdqa %%ymm0, %%ymm1` from outer loop makes problem go away. Also if I use VEX instructions in inner loop the is no slow-down seen. Switching VZEROUPPER to VZEROALL does not appear to be making difference... actually replacing `vmovdqa %%ymm0, %%ymm1` & vzeroupper with just vzeroall makes the same problem appear.

Questions I now have are,
1. Have anyone else seen the same?
2. Am I doing something wrong?
3. How to avoid this slow down? I have 32-bit x86 code using mixed VEX & non-VEX portions with proper VZEROUPPER/VZEROALL after YMM usage but non-VEX parts are now running slower than expected with Zen4... I'd like to find way to avoid that (going 64-bit is obvious one, but this is in open-source library and 32-bit builds are still a thing).

I've attached C source-code that I've been using for testing this, compiles with GCC:

Code: Select all

$ gcc -m64 -O2 -Wall zen4_ymm_32bitmode.c -o zen4_ymm_32bitmode
$ ./zen4_ymm_32bitmode
x86-64-test1: 1000*1000000 iterations, 1020975689 cycles
x86-64-test2: 1000*1000000 iterations, 1017973047 cycles
x86-64-test3: 1000*1000000 iterations, 1006376455 cycles
x86-64-test4: 1000*1000000 iterations, 1009606154 cycles
x86-64-test5: 1000*1000000 iterations, 1002171365 cycles
x86-64-test6: 1000*1000000 iterations, 1002110782 cycles
$ gcc -m32 -O2 -Wall zen4_ymm_32bitmode.c -o zen4_ymm_32bitmode
$ ./zen4_ymm_32bitmode
  i386-test1: 1000*1000000 iterations, 1003244375 cycles
  i386-test2: 1000*1000000 iterations, 1002098374 cycles
  i386-test3: 1000*1000000 iterations, 2003160445 cycles
  i386-test4: 1000*1000000 iterations, 2003677727 cycles
  i386-test5: 1000*1000000 iterations, 2003784511 cycles
  i386-test6: 1000*1000000 iterations, 1002279888 cycles

Post by **agner** » 2023-06-04, 4:54:03

It appears that moves are eliminated in 64 bit mode, but not in 32 bit mode. Perhaps this explains your results. Try with some other instructions.

jkivilin · Post by **jkivilin** » 2023-06-04, 15:29:08

I tried replacing all "movdqa reg1, reg2" with "pshufd $0xE4, reg1, reg2" and I'm still getting same results.

I also tried to see if forced context switching would allow performance to recover. I modified program to do the YMM instruction + VZEROUPPER just at the beginning of the program, before test loops. I start test program with CPU affinity on one core, then change its affinity to other core. This forced context switch appears to recover performance (running tests on 64-bit linux).

Code: Select all

$ gcc -m32 -O2 -Wall zen4_ymm_32bitmode.c -o zen4_ymm_32bitmode -DTEST=3 # TEST=3: YMM instr & VZEROUPPER before test loops, PSHUFD & PADDD in test loop
$ taskset -c 4 ./zen4_ymm_32bitmode & (sleep 0.01;taskset -p -c 4 $!) ; wait # start program on core 4, then switch to core 4 after 10msec (no context switch).
[1] 355807
pid 355807's current affinity list: 4
pid 355807's new affinity list: 4
  i386-test3: 1000*1000000 iterations, 2003642806 cycles
[1]+  Done                    taskset -c 4 ./zen4_ymm_32bitmode
$ taskset -c 4 ./zen4_ymm_32bitmode & (sleep 0.01;taskset -p -c 3 $!) ; wait # start program on core 4, then switch to core 3 after 10msec (context switch and jump to other core).
[1] 355811
pid 355811's current affinity list: 4
pid 355811's new affinity list: 3
  i386-test3: 1000*1000000 iterations, 1036464821 cycles
[1]+  Done                    taskset -c 4 ./zen4_ymm_32bitmode

Well, just adding "usleep(10)" between YMM/VZEROUPPER and non-VEX test loop is able to restore performance. Plain syscall to 64-bit kernel does not seem to be enough however, tried adding "fopen" between VEX and non-VEX code but that did not help.

slowjuggler · Post by **slowjuggler** » 2023-07-14, 16:55:11

Hi!
in this matrix multiplication code, there is a similar situation, when / if it enters the cycle with sse through vzeroupper, then the performance is significantly reduced.

.globl mul
mul:
push %rbp
push %rbx
push %r12
push %r13
push %r14
push %r15
movq %rsp, %rbp
movq %rdi, %r10
movq %rsi, %r11
movq %rdx, %r12
xorq %r13, %r13
movq $4, %rax
l1:
cmpq %r13, %rcx
je l6
xorq %r14, %r14
movq (%r10, %r13, 8), %rdi
movq (%r12, %r13, 8), %rdx
incq %r13
l2:
cmpq %r14, %r9
je l1
vpbroadcastq (%rdi, %r14, 8), %ymm0
movq (%r11, %r14, 8), %rsi
xorq %r15, %r15
incq %r14
cmpq %rax, %r8
jb l5
l3:
cmpq %r15, %r8
je l2
movq %r15, %rbx
subq %r8, %rbx
incq %rbx
cmpq %rax, %rbx
jb l4
vmovups (%rsi, %r15, 8), %ymm1
vmovups (%rdx, %r15, 8), %ymm3
vfmadd231pd %ymm0, %ymm1, %ymm3
vmovups %ymm3, (%rdx, %r15, 8)
addq %rax, %r15
jmp l3
l4:
vzeroupper
subq $3, %r15
l5:
cmpq %r15, %r8
je l2
movsd (%rsi, %r15, 8), %xmm1
mulsd %xmm0, %xmm1
movsd (%rdx, %r15, 8), %xmm2
addsd %xmm1, %xmm2
movsd %xmm2, (%rdx, %r15, 8)
incq %r15
jmp l5
l6:
pop %r15
pop %r14
pop %r13
pop %r12
pop %rbx
pop %rbp
ret

Agner's CPU blog

VZEROUPPER issue with Zen4 in 32-bit mode?

VZEROUPPER issue with Zen4 in 32-bit mode?

Re: VZEROUPPER issue with Zen4 in 32-bit mode?

Re: VZEROUPPER issue with Zen4 in 32-bit mode?

Re: VZEROUPPER issue with Zen4 in 32-bit mode?