Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Multithreads load-store throughput for bulldozer
Author:  Date: 2014-06-27 23:59
I wondered why load/store throughput is down when multiple threads are active, as you said at "14.17 Cache and memory access" in microarchitecture.pdf. So I wrote a simple benchmark test below. This counts load-store loops done in 1 second.

/* bench.c */
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
volatile size_t a, b;
static void ringonger(int _)
{
printf("sum=%zu\n", a);
exit(0);
}
int main()
{
size_t sum = 0;
b = 1;
if (SIG_ERR == signal(SIGALRM, ringonger))
perror("set signal");
alarm(1);
/* the load-store loop */
while (1) {
sum += b;
a = sum;
}
return 0;
}

I used gcc-4.7.3 compiler. The loop-store loop was compiled so tiny that bulldozer's instruction prefetcher becomes bottle neck.

.L5:
movq b(%rip), %rdx
addq %rdx, %rax
movq %rax, a(%rip)
jmp .L5

So, I unrolled (and rewrote) the loop.

.L5:
addq b(%rip), %rax
movq %rax, a(%rip)
addq b(%rip), %rax
movq %rax, a(%rip)
addq b(%rip), %rax
movq %rax, a(%rip)
addq b(%rip), %rax
movq %rax, a(%rip)
addq b(%rip), %rax
movq %rax, a(%rip)
jmp .L5

On bulldozer, each "addq mem, reg ; movq reg, mem ;" instruction sequence generates 2 EX micro-opt and 2 AGU micro-opt with 1 load and 1 store, which spends no more and no less than the 4 pipeline ports(2 EX and 2 AGU) and 2 L1Dcache ports in one bulldozer core(not module = 2 cores). So, I expected the throughput is 1 clock per the load-store loop.And my measurement on FX-8350@3.4Ghz(Sorry for piledriver not bulldozer.) results like below.

ideal value
sum=3400000000

1 thread
sum=3243684085 (95.4% of ideal)

4 threads (each thread occupies a bulldozer module)
$ for ((i=0;i<4;i++)) ; do ./a.out & true ; done
sum=3202300235
sum=3251539046
sum=3082057158
sum=3123608353
avarage=3164876198 (93.0% of ideal)
max - min = 3251539046 - 3082057158 = 169481888 (5.0% of ideal)

8 threads (2 threads run at each bulldozer module)
$ for ((i=0;i<8;i++)) ; do ./a.out & true ; done
sum=1683970930
sum=1670163880
sum=1671548424
sum=1701689856
sum=1703203849
sum=1704993658
sum=1674488405
sum=1707194181
avarage=1689656647.875 (49.7% of ideal)
max - min = 1707194181 - 1670163880 = 37030301 (1.1% of ideal)


While the case a thread per module earned 90%+ of the ideal value, 2 threads per module only differed from a half of the ideal within 2%. The half value means the two load-store units per module share some important units, which limit the load-store throughput of a entire module. I doubt the load-store unit providing 2 ports is shared among 2 cores in a module like the instruction decoder.
 
thread Test results for AMD Bulldozer processor new - Agner - 2012-03-02
replythread Test results for AMD Bulldozer processor new - Massimo - 2012-03-13
reply Test results for AMD Bulldozer processor new - Agner - 2012-03-14
last reply Test results for AMD Bulldozer processor new - Alex - 2012-03-14
replythread Test results for AMD Bulldozer processor new - fellix - 2012-03-15
last replythread Test results for AMD Bulldozer processor new - Agner - 2012-03-16
last replythread Test results for AMD Bulldozer processor new - Massimo - 2012-03-16
last replythread Test results for AMD Bulldozer processor new - Agner - 2012-03-17
reply Test results for AMD Bulldozer processor new - avk - 2012-03-17
last replythread Test results for AMD Bulldozer processor new - Massimo - 2012-03-17
last replythread Test results for AMD Bulldozer processor new - Agner - 2012-03-17
last replythread Test results for AMD Bulldozer processor new - Massimo - 2012-03-20
last replythread Test results for AMD Bulldozer processor new - Agner - 2012-03-21
last reply Cache WT performance of the AMD Bulldozer CPU new - GordonBGood - 2012-06-05
reply Test results for AMD Bulldozer processor new - zan - 2012-04-03
replythread Multithreads load-store throughput for bulldozer - A-11 - 2014-06-27
last replythread Multithreads load-store throughput for bulldozer new - Bigos - 2014-06-28
last reply Multithreads load-store throughput for bulldozer new - A-11 - 2014-07-04
last reply Store forwarding stalls of piledriver new - A-11 - 2014-09-07