Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Test results for AMD Bulldozer processor - Agner - 2012-03-02
replythread Test results for AMD Bulldozer processor - Massimo - 2012-03-13
reply Test results for AMD Bulldozer processor - Agner - 2012-03-14
last reply Test results for AMD Bulldozer processor - Alex - 2012-03-14
replythread Test results for AMD Bulldozer processor - fellix - 2012-03-15
last replythread Test results for AMD Bulldozer processor - Agner - 2012-03-16
last replythread Test results for AMD Bulldozer processor - Massimo - 2012-03-16
last replythread Test results for AMD Bulldozer processor - Agner - 2012-03-17
reply Test results for AMD Bulldozer processor - avk - 2012-03-17
last replythread Test results for AMD Bulldozer processor - Massimo - 2012-03-17
last replythread Test results for AMD Bulldozer processor - Agner - 2012-03-17
last replythread Test results for AMD Bulldozer processor - Massimo - 2012-03-20
last replythread Test results for AMD Bulldozer processor - Agner - 2012-03-21
last reply Cache WT performance of the AMD Bulldozer CPU - GordonBGood - 2012-06-05
reply Test results for AMD Bulldozer processor - zan - 2012-04-03
replythread Multithreads load-store throughput for bulldozer - A-11 - 2014-06-27
last replythread Multithreads load-store throughput for bulldozer - Bigos - 2014-06-28
last reply Multithreads load-store throughput for bulldozer - A-11 - 2014-07-04
last reply Store forwarding stalls of piledriver - A-11 - 2014-09-07
 
Test results for AMD Bulldozer processor
Author: Agner Date: 2012-03-02 06:57

I have now got the time to test the AMD Bulldozer after being delayed by other projects.

The AMD Bulldozer is a major redesign of previous microarchitectures. The most notable points are

  • Aggressive power-saving features.
     
  • The chip has 2 - 8 "compute units" with two CPU cores each.
     
  • The code cache, instruction decoder, branch prediction unit and floating point execution unit are shared between two cores, while the level-1 data cache and the integer execution units are separate for each core.
     
  • A level-3 cache is shared between all compute units.
     
  • The pipeline can support 4 instructions per clock cycle.
     
  • Supports AVX instructions. Intel announced the AVX instruction set extension in 2008 and the AMD designers have had very little time to change their plans for the Bulldozer to support the new 256-bit vectors defined by AVX. The Bulldozer splits each 256-bit vector into two 128-bit vectors, as expected, but the throughput is still good because most floating point execution units are doubled so that the two parts can be processed simultaneously.
     
  • The maximum throughput is four 128-bit vectors or two 256-bit vectors per clock cycle if there is an equal mixture of integer vector and floating point vector operations. This throughput will probably be sufficient to service two threads in most cases.
     
  • Supports fused multiply-and-add instructions. These new instructions can do one addition and one multiplication in the same time that it otherwise takes to do one addition or one multiplication. It uses the FMA4 instuction codes designed by Intel, but unfortunately Intel have later changed their plans to FMA3, as discussed on this blog.
     
  • Introduces AMD's new XOP instruction set extension with many useful instructions. Unfortunately, these instructions will rarely be used because they are unlikely to be supported by Intel.
     
  • The 3DNow instruction set is no longer supported. I don't think anybody will miss it.
     
  • Improved branch prediction with two-level branch target buffer.
     
  • Register-to-register moves are translated into register renaming with zero latency. For years, I have wondered why no CPU did this (except for the FXCH instruction). Now the Bulldozer is the first x86 processor to implement this feature. It works very well with four register renamings per clock cycle, but only for 128-bit registers, not for general purpose registers, x87 registers or 256-bit registers.

The test results are mostly good and many weaknesses of previous designs have been eliminated. However, there are still some weak points and bottlenecks that need to be mentioned:

  • The power saving features are reducing the clock frequency most of the time. This often gives low and inconsistent results in benchmark tests because the clock frequency is varying.
     
  • Some operating systems are not aware that the chip shares certain resources between the two cores that make up a compute unit. The consequence is that the operating system may put two threads into one compute unit while another unit is idle, or it may put two threads with different priority into the same compute unit so that a low priority thread can steal resources from a high priority thread. I don't understand why there is no CPUID function for telling which resources are shared between CPU cores. The current solution where the operating system must know the details of every CPU on the market is not practical, and it does not work with virtual CPUs etc.
     
  • The shared instruction fetch unit can fetch up to 32 bytes per clock cycle or 16 bytes per core. This may be a bottleneck when both cores are active and when frequent jumps produce bubbles in the pipeline.
     
  • The decode unit can handle four instructions per clock cycle. It is alternating between the two threads so that each thread gets two instructions per clock cycle on average. This is a serious bottleneck because the rest of the pipeline can handle up to four instructions per clock.
     
  • Cache bank conflicts in the data cache are so frequent that it seriously degrades the performance in some tests.
     
  • The code cache has only two ways which may be insufficient to service two simultaneous threads.
     
  • The long pipeline causes long branch misprediction penalties.
     
  • The pipelines can handle four instructions per clock cycle, but there are only two integer ALUs where previous processors had three. This means that two of the four pipeline lanes will be idle most of the time in integer code.
     
  • Some floating point operations, such as shuffle, blend and booleans, are executed in the integer vector units. This causes an extra transport delay between the floating point vector unit and the integer vector unit.
   
Test results for AMD Bulldozer processor
Author:  Date: 2012-03-13 12:43
Hi Agner,
I'd like to know your opinion about few things I was thinking about BD's architecture:

* What do you think of BD's AGU not being able to issue LS-related instructions like mov r/m? i.e. K10 could issue memory instructions in AGU, whereas BD cannot - and PD the same (just mov r/r and such takes AGU path for renaming, i think). From BD manual, almost no instruction ends up alone in AGU (contrary to K10). it seems to me they moved toward having a fixed max 2 instr throughtput/core, a huge stepdown from previous (ideal) 6. Considering that MOVs are everywhere and decode fast, it seems to me a huge limit to overall IPC.
* The split L2 cache access - do you think they'd do better using a contention mechanism for the whole cache, instead of splitting its access in half?
* Do you think AMD will add a trace cache to fix the bad dual-core decoder throughput like intel did? I cant figure a fix for that (decoding 6 instructions would not work, making two x-1/x-1 decoders would double the first instr. decoder).
* What do you think about the L1D WT choice with higher latency (coupled with a WCC halfaway the L2)? Does it impact much the speed for you?

On a last note: I was thinking of BD's IPC - 2ALU+2ALU(+2 FPU but they share LS with ALU..). SB could sustain 4 instr /cycle in loops thanks to the TC, but the BD decoder would likely trounce the IPC to 1,x/core no? Is it the shared decoder the bigger stopper for BD, or the reworked AGU?
Do you think if AMD reworks the front-end for getting a near 2 instr/cycle/core, it will still lack without the ability to parallelize MOVs?

Thanks,
Massimo


Hi Agner, I've seen you updated the instruction table - and it seems different from AMD one! So MOV r/m is issued in AGU... but mov m/r is not???
   
Test results for AMD Bulldozer processor
Author: Agner Date: 2012-03-14 01:40
Massimo wrote:
Do you think AMD will add a trace cache to fix the bad dual-core decoder throughput like intel did?
Decoding is often a bottleneck in CISC designs. The trace cache on Intels Netburst (P4) was not very successful. I think it would be better to have one set of decoders per thread in the Bulldozer. AMD has instruction boundaries marked in the code cache which, strangely, Intel don't. So an extra set of decoders would be just a matter of die space and power consumption and it would greatly increase the throughput.

It is strange that the floating point throughput is higher than the integer throughput on the Bulldozer. Later versions of Bulldozer can also do register-to-register moves in the two AGU pipelines, according to AMD manuals. I guess they will add more instructions to these pipelines to get a 4 instruction integer throughput in the future.

Others have criticized the cache design on the Bulldozer. I am not an expert in cache performance so I will not comment on that.

   
Test results for AMD Bulldozer processor
Author:  Date: 2012-03-14 09:07
Massimo wrote:
* What do you think about the L1D WT choice with higher latency (coupled with a WCC halfaway the L2)? Does it impact much the speed for you?
In his analysis Agner also wrote about an instruction-throughput penalty with both cores active. Instead of 4 instructions per clock, he could only measure around ~3 instr. per clock on average. I speculate that this is the effect of the L1's WT strategy. Because of WT, stores have to be send to the L2, but the L2 can probably only handle *one* store instruction per clock, not 2. Thus, only 3 instr. instead of 4 per module. Agner also reported a maximum of ~3.6-3.7 instructions. Maybe he got more loads than the usual 2:1 load to store ratio in that case. But I dont know his code so I cant say for sure, only speculate.
   
Test results for AMD Bulldozer processor
Author: fellix Date: 2012-03-15 15:01
The L1D cache in BD is probably also under-performing by its own merit, compared to the K10 implementation. Truly its associativity is doubled but the size is 1/4 of the previous architecture, witch yields a lower overall hit rate than the old 64KByte 2-way solution. This, combined with the WT policy, that relies too much on the anemic and latent L2 cache makes the whole memory pipeline quite inefficient and hogs the data flow in many corner cases. The sheer size of the caches in BD is simply inadequate to compensate for the poor overall design. I think the L2 caches are the main stumbling block for the architecture in BD, additionally burdened to handle all the snoop traffic, since the L2-to-L3 relation is [mostly] exclusive. The good thing is that the HW prefetching in BD is more flexible now, and can fetch data directly in to the L2 (probably one of the reasons for AMD to make them so large). Sill, all this is a far cry from what Intel has achieved over the years in both efficiency and wide scaling across the product range. Bulldozer is simply a chaotic patch-work of counterintuitive ideas with no leading prospects.
   
Test results for AMD Bulldozer processor
Author: Agner Date: 2012-03-16 01:18
fellix wrote:
Bulldozer is simply a chaotic patch-work of counterintuitive ideas with no leading prospects.
This is an opinion of an anonymous poster. I prefer to see experimental evidence.
   
Test results for AMD Bulldozer processor
Author:  Date: 2012-03-16 17:01
@Alex: No, because L1D stores are sent to a 4KB WB buffer for coalescing before L2 -that's why L1D is WT, of course. It might be interesting to do 'overflow' such buffer and see what happens. Could the full load LS fractional number depend on a hidden latency that happens from time to time when WCC is forced to free a line down to L2?
@Fellix: BD architecture looks interesting and innovative. Would you mind to share your detailed MiArch comparison manuals with us?

Agner, re-reading the manuals I noticed a point I did overlook: the BD decoder is NOT evolved in a 2-1-1-1 (4 instruction) like the IA, but it's still a 2-1-1, so the "4 instr/cycle" is actually a double-path and two single path!
Since much more BD instructions are single-mop (the most used ones in my experience and my analysis) compared to Intel, wouldn't it make for a much better decoder throughput than Intel, if they had a 2-1-1-1 one?

..in essence, such decoder can at BEST 'pump up' 1,5 instructions/cycle to the ALU on a full load, max 3 on a single-core load if no double-path instructions are crossed. odd.

   
Test results for AMD Bulldozer processor
Author: Agner Date: 2012-03-17 01:27
Massimo wrote:
the BD decoder is NOT evolved in a 2-1-1-1 (4 instruction) like the IA, but it's still a 2-1-1, so the "4 instr/cycle" is actually a double-path and two single path!
The decoder can do 2-1-1 or 1-1-1-1, but not 2-2. So the total decoder throughput is 4 unless there are many double instructions.
   
Test results for AMD Bulldozer processor
Author:  Date: 2012-03-17 03:35
What about the other decoding schemes: 1-1-2, 1-2-1? Can BD work with them too?
BTW, how do you think: is there any chance that AMD soon will implement the register renaming for YMM? If yes, will it help to implement the 2-2 scheme? Or should we wait for the 128->256 bit broadening somewhen in the 22/20 nm BD's derivative?
   
Test results for AMD Bulldozer processor
Author:  Date: 2012-03-17 04:30
Agner wrote:
The decoder can do 2-1-1 or 1-1-1-1
I'm losing you here - do you mean the first decoder can actually fetch one 2-MOP or two 1MOP instruction and decode them? how they did that, do they tag path length too? Even in that case, how could it...???

Let me go straight on example:
XCHG a,b
TEST c,c
MOV r,[m]
is 2-1-1, decode in one cycle (i took last mov in AGU, as you reported)
ADD a,b
ADD c,d
MOV s,[m]
MOV d,[m]
is a 1-1-1-1 and still decode in ONE cycle?
How can the first decode choose if it should fetch one or two instructions (maybe l1 tag too?), and even in such case, how's possible it can decode two instructions...

   
Test results for AMD Bulldozer processor
Author: Agner Date: 2012-03-17 12:41
avk wrote:
What about the other decoding schemes: 1-1-2, 1-2-1? Can BD work with them too?
I don't think so. Haven't tried.
BTW, how do you think: is there any chance that AMD soon will implement the register renaming for YMM? If yes, will it help to implement the 2-2 scheme? Or should we wait for the 128->256 bit broadening somewhen in the 22/20 nm BD's derivative?
All YMM instructions generate two mops, regardless of renaming. A sequence of YMM instructions can not decode as 2-2 so it is less efficient than the corresponding XMM instructions. They will probably fix this somehow in a later version when YMM instructions become more common. It would be much cheaper to improve the decoders than to make a full 256-bit databus and execution unit.

Massimo wrote:

do you mean the first decoder can actually fetch one 2-MOP or two 1MOP instruction and decode them?
I guess it works like this:
There are 4 parallel decode lines. When it finds a double instruction in the first of the 4 lines, and single instructions in the next two lines, it will generate 2-1-1. If it finds a double instruction in any of the other lines, it will delay it to the next clock cycle and put it in line 1.
   
Test results for AMD Bulldozer processor
Author:  Date: 2012-03-20 09:16
Hi, thanks - I see now what you mean. 4 tagged instructions goes to decoder 0-3; if decoder 0 gets 2MOP then decoder 3 resources are used/stalled. So, it's likely a decoder (3) shares the bus with decoder 0 for outputting MOPS to the OOOE scheduler.
Still, I cannot understand the huge IPC penalty of BD over SB. The LS is almost the same since nehalem (2R/1R1W), BD has a VERY slow REP MOVS (1/3 of SB, sounds very worrying if you consider that mem/str/array copies are still implemented with rep movx), but it cannot account for the performance loss. The lack of the 3rd ALU is important, yet OOOE could easily mix MOV and other instructions in between - for full(?) throughput I need to schedule asm instructions manually on IA.
So, while CMT is surely slown down alot by the decoder - what do you think of single-core performance pitfall, even with regards to K10 architecture? BD decoder/retirement seems better than K10 (max 4 MOPS), yet it lags behind.
   
Test results for AMD Bulldozer processor
Author: Agner Date: 2012-03-21 01:59
Massimo wrote:
BD has a VERY slow REP MOVS
I think Sandy Bridge has a special implementation of REP MOVS, moving a whole cache line at a time under certain conditions. Many function libraries implement memcpy as a loop of aligned xmm moves, which is efficient on all processors.

You can still get a throughput of 4 instructions per clock on Bulldozer on a single thread if you mix integer and floating point instructions, og you mix different type of vector instructions.

   
Cache WT performance of the AMD Bulldozer CPU
Author:  Date: 2012-06-05 05:42
Agner wrote
Cache bank conflicts in the data cache are so frequent that it seriously degrades the performance in some tests.
Massimo wrote
What do you think about the L1D WT choice with higher latency (coupled with a WCC halfaway the L2)? Does it impact much the speed for you?
Alex wrote
Because of WT, stores have to be send to the L2, but the L2 can probably only handle *one* store instruction per clock, not 2.
fellix wrote
This, combined with the WT policy...
Massimo wrote
@Alex: No, because L1D stores are sent to a 4KB WB buffer for coalescing before L2 -that's why L1D is WT, of course. It might be interesting to do 'overflow' such buffer and see what happens. Could the full load LS fractional number depend on a hidden latency that happens from time to time when WCC is forced to free a line down to L2?
Massimo wrote
Still, I cannot understand the huge IPC penalty of BD over SB. The LS is almost the same since nehalem (2R/1R1W), BD has a VERY slow REP MOVS (1/3 of SB, sounds very worrying if you consider that mem/str/array copies are still implemented with rep movx), but it cannot account for the performance loss.

Definition of terms: BD = AMD Bulldozer, SB = Intel Sandy Bridge, L1D = L1 Data Cache, WT = Write-Through policy as compared to WB = Write-Back policy, L2 = L2 cache, KB = KiloByte, WCC = Write Coalescing Cache (an AMD term), IPC = Instructions Per Clock (cycle), REP MOVSX = repeating x86 instructions that along with REP STOSX instructions are often used to quickly copy or set the contents of strings or arrays.

To all: I think you are onto something in wondering if the new cache organization is responsible for the loss of BD performance. Specifically, it seems that many (Alex, et al) don't understand, as I also did not at first, that WT cache policy applies to all caches, which means that when the WCC buffer overruns that the output is written both to the common L2 cache and also to main memory with its 161 clock cycle latency.

It seems that the AMD Bulldozer optimization manual (search for Software Optimization Guide for AMD Family 15h Processors download) recommends discrete unrolled writes for small copies and using the library functions (usually implemented with the REP MOVSX and REP STOSX instructions) for larger moves. I wondered why, so I used the following little C++ test program to investigate the speed of write operations in BD:

#define LOOPS 100
#define ITERS = 40000000
#define L1SIZE 16384
#define BUFSIZE = 4096 // L1SIZE / sizeof(unsigned int)

unsigned int buf[BUFSIZE];

for (unsigned int l = 0; l < LOOPS; ++l)
for (unsigned int i = 0; i < ITERS * BUFSIZE; i += BUFSIZE)
memset(buf, 1, L1SIZE); // glibc will convert this to REP STOSX instruction for the most part...

along with a loop that counts the contents of the buf array outside of the timing so that the C++ compiler doesn't optimize away the work in the loop. We find that it the fill array operation fills at a rate of four bytes per clock cycle for either 32 or 64 bit code, which isn't too bad. Yes, SB may be faster in using more specialized optimizations, but this rate will get the job done for most memory, array, or string initializations. Let's look at this again after we visit the loop way of performing this, as not all work can be done using REP MOV or STO instructions. Changing the loop code to the following:

for (unsigned int l = 0; l < LOOPS; ++l)
for (unsigned int i = 0; i < ITRS; i += BUFSIZE)
for (unsigned int j = 0; j < BUFSIZE; ++j)
buf[j] = j; // in order to prevent the compiler from optimizing this into a memset() function

reveals that it is taking about 16 clock cycles to initialize each four byte member of the array and the time isn't all that much different no matter if 32-bit or 64-bit code is used, which is surprising considering that the inner loop should only be a little more than two clock cycles in duration. This proves that BD has some special optimization for the REP MOVSX and STOSX instructions that bypass the cache limitations even though these optimizations don't quite take it to the level of SB which reputedly initializes by cache line size per cycle. Apparently, the reason that AMD recommend the use of the REP instructions for larger copy and fill operations is that there is overhead to these optimizations in forwarding the results back to the caches and thus they only make sense for larger blocks.

Let's show this by changing the array to an array of chars as follows:

#define BUFSIZE = 16384 // L1SIZE / sizeof(unsigned char)

unsigned char buf[BUFSIZE];

for (unsigned int l = 0; l < LOOPS; ++l)
for (unsigned int i = 0; i < ITRS; i += BUFSIZE)
for (unsigned int j = 0; j < BUFSIZE; ++j)
buf[j] = j; // in order to prevent the compiler from optimizing this into a memset() function

and we find that the run time doesn't change hardly at all even though four times as many loops are being used to fill the same array size in bytes. In other words, the limit on these memory fill operations is not CPU execution rate but rather something else.

Let's analyse this in light of what we know: That this array fills the 16 KB L1D, which has a WT policy but is backed by the (only) 4 KB WCC that combines successive writes into the 64 byte cache line size. Thus, sixteen of these four-byte double word writes are being combined into a single write, meaning that once the WCC overflows it will be outputting a cache line each sixteen inner loops. Since the inner loop time without the memory write stall would be something like 40 clock cycles, this single thread is stalled for a high percentage of its time waiting for memory, which is not the L2 with a latency of about 20 clock cycles but the main memory. This latency is as for different sizes of array element up to quad words right down to byte level.

Now let's look at how this WT memory latency impacts on multi-threading performance by turning on eight cores using OpenMP:

#define NUMTHREADS 8

for (unsigned int l = 0; l < LOOPS; ++l)
{

#pragma omp parallel if(NUMTHREADS > 1) num_threads(NUMTHREADS) reduction(+ : count) private(buf)
{
unsigned char buf[BUFSIZE];

#pragma omp for schedule(static, 1) nowait
for (int chnk = 1; chnk <= NUMTHREADS; ++chnk)
{
for (unsigned int p = 0; p < (NUMPRIMES / NUMTHREADS); p += BUFSIZE)
for (unsigned int j = 0; j < BUFSIZE; ++j)
buf[j] = j; // in order to prevent the compiler from optimizing this into a memset() function
}

for (unsigned int i = 0; i < BUFSIZE; ++i) // necessary to prevent compiler from optimizing work away
count += buf[i];

}
}

which shows a net gain of something less than a factor of two in spite of eight cores sharing the work and this is true no matter if the buf array is private to the threads as shown here or refers back to the global buf array. This reveals that the whole process is memory latency bound and can't take a very great advantage of multi-threading. Going back to the REP MOVSX/STOSX way of doing things, even though it is many times faster due to bypassing the caches for large blocks of memory, it is also memory latency bound such that multiple threads sharing main memory are also not scalable by the number of threads used when initializing memory.

Of course, this is a trivial example as memory fill operations aren't normally a large percentage of what an algorithm does. However, I feel that these little examples show the BD limitations in a larger sense, specifically why only operations that have fairly complex loops but generate a quite limited number of final write operations (intermediate write operations will be combined away by the WCC if the loop is written correctly) are the main applications that can take advantage of BD's larger number of cores - applications such as 7Zip or WinRar if written to be able to take advantage of this. Applications that have fairly simple loops and generate a higher number of final write operations will have this memory latency bottleneck and not be able to take full advantage of the larger number of cores, especially if those operations require skipping up through memory with a span larger than 64 bytes as in prime number sieve culling (such as primesieve).

Algorithms that can be written to use no more than two KB of L1D won't have this limitation as the WCC will be able to fully buffer the L1D even if both cores of the pair are used, but it is not always possible to minimize an algorithm to this level.

The very best fix for this, given that the shared caches and the WT write policy are buried deeply into the BD design right from the beginning and changing their design would require scrapping BD, would be to increase the WCC to a 32 KB size and perhaps even split it into two 16 KB WCC's each servicing each of the cores if that gives an advantage. The cost would be something about one million (high speed = fairly high power consumption) transistors per core, but that seems reasonably trivial considering that there are approximately 60 to 70 million transistors per core after excluding those used by the L2 and L3 caches and support circuitry.

   
Test results for AMD Bulldozer processor
Author:  Date: 2012-04-03 05:24
Agner wrote:
  • Some operating systems are not aware that the chip shares certain resources between the two cores that make up a compute unit. The consequence is that the operating system may put two threads into one compute unit while another unit is idle, or it may put two threads with different priority into the same compute unit so that a low priority thread can steal resources from a high priority thread. I don't understand why there is no CPUID function for telling which resources are shared between CPU cores. The current solution where the operating system must know the details of every CPU on the market is not practical, and it does not work with virtual CPUs etc.
     
  • The hardware locality project[1] attempts to deal with this to an extent, but so far it is only really used by a few, although it has been rolled into OpenMPI. I would agree that it isn't really ideal though. I don't expect the application programmers to start thinking about this stuff any time soon.

    [1]
    www.open-mpi.org/projects/hwloc/

       
    Multithreads load-store throughput for bulldozer
    Author:  Date: 2014-06-27 23:59
    I wondered why load/store throughput is down when multiple threads are active, as you said at "14.17 Cache and memory access" in microarchitecture.pdf. So I wrote a simple benchmark test below. This counts load-store loops done in 1 second.

    /* bench.c */
    #include <stdio.h>
    #include <unistd.h>
    #include <signal.h>
    volatile size_t a, b;
    static void ringonger(int _)
    {
    printf("sum=%zu\n", a);
    exit(0);
    }
    int main()
    {
    size_t sum = 0;
    b = 1;
    if (SIG_ERR == signal(SIGALRM, ringonger))
    perror("set signal");
    alarm(1);
    /* the load-store loop */
    while (1) {
    sum += b;
    a = sum;
    }
    return 0;
    }

    I used gcc-4.7.3 compiler. The loop-store loop was compiled so tiny that bulldozer's instruction prefetcher becomes bottle neck.

    .L5:
    movq b(%rip), %rdx
    addq %rdx, %rax
    movq %rax, a(%rip)
    jmp .L5

    So, I unrolled (and rewrote) the loop.

    .L5:
    addq b(%rip), %rax
    movq %rax, a(%rip)
    addq b(%rip), %rax
    movq %rax, a(%rip)
    addq b(%rip), %rax
    movq %rax, a(%rip)
    addq b(%rip), %rax
    movq %rax, a(%rip)
    addq b(%rip), %rax
    movq %rax, a(%rip)
    jmp .L5

    On bulldozer, each "addq mem, reg ; movq reg, mem ;" instruction sequence generates 2 EX micro-opt and 2 AGU micro-opt with 1 load and 1 store, which spends no more and no less than the 4 pipeline ports(2 EX and 2 AGU) and 2 L1Dcache ports in one bulldozer core(not module = 2 cores). So, I expected the throughput is 1 clock per the load-store loop.And my measurement on FX-8350@3.4Ghz(Sorry for piledriver not bulldozer.) results like below.

    ideal value
    sum=3400000000

    1 thread
    sum=3243684085 (95.4% of ideal)

    4 threads (each thread occupies a bulldozer module)
    $ for ((i=0;i<4;i++)) ; do ./a.out & true ; done
    sum=3202300235
    sum=3251539046
    sum=3082057158
    sum=3123608353
    avarage=3164876198 (93.0% of ideal)
    max - min = 3251539046 - 3082057158 = 169481888 (5.0% of ideal)

    8 threads (2 threads run at each bulldozer module)
    $ for ((i=0;i<8;i++)) ; do ./a.out & true ; done
    sum=1683970930
    sum=1670163880
    sum=1671548424
    sum=1701689856
    sum=1703203849
    sum=1704993658
    sum=1674488405
    sum=1707194181
    avarage=1689656647.875 (49.7% of ideal)
    max - min = 1707194181 - 1670163880 = 37030301 (1.1% of ideal)


    While the case a thread per module earned 90%+ of the ideal value, 2 threads per module only differed from a half of the ideal within 2%. The half value means the two load-store units per module share some important units, which limit the load-store throughput of a entire module. I doubt the load-store unit providing 2 ports is shared among 2 cores in a module like the instruction decoder.
       
    Multithreads load-store throughput for bulldozer
    Author: Bigos Date: 2014-06-28 02:46
    What about L2 cache? Since L1D is a write-through design, any write to this cache will be automatically written to L2, which is shared between cores in a module.
    Of course there is a coalescing buffer, but IIUC it is meant to coalesce scattered writes to different addresses into a single transaction. In your program, you only write to a single address, so it's not effective.
       
    Multithreads load-store throughput for bulldozer
    Author:  Date: 2014-07-04 09:13
    Agner measured in the same chapter that L2 write throughput is 1 line(64 bytes) per 12 clocks. But as we saw, my program wrote almost 64 bytes per 8 clocks (8 bytes per clock) which exceeds Agner's measurement. So, the buffer did more than coalescing, for example, eliminating duplicated writes on same address.

    By the way, I also wrote another benchmark which has no memory writes. So this benchmark must avoid L2 write-through access.


    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <signal.h>
    volatile size_t a, b;
    register size_t sum0 asm("%r12"), sum1 asm("%r13");
    static void ringonger(int _)
    {
    printf("%zu+%zu=%zu\n", sum0, sum1, sum1 + sum0);
    exit(0);
    }
    int main()
    {
    a = 1;
    b = 1;
    sum0 = 0;
    sum1 = 0;
    if (SIG_ERR == signal(SIGALRM, ringonger))
    perror("set signal");
    alarm(1);
    while (1) {
    sum0 = sum0 + a;
    sum1 = sum1 + b;
    }
    return 0;
    }

    It needs some gcc extension for global registers, but gcc-4.7.3 I used dumps invalid loops.

    .L5:
    movq a(%rip), %rax
    movq b(%rip), %rax
    jmp .L5

    Fix it.

    .L5:
    addq a(%rip), %r12
    addq b(%rip), %r13
    addq a(%rip), %r12
    addq b(%rip), %r13
    addq a(%rip), %r12
    addq b(%rip), %r13
    addq a(%rip), %r12
    addq b(%rip), %r13
    addq a(%rip), %r12
    addq b(%rip), %r13
    jmp .L5

    As each bulldozer core has 2 read ports, I expected "add mem reg ; add mem reg" sequence per core spends 1 clock. The result of my measurements are below.

    ideal value
    6800000000

    1 thread
    6375394786 (93.9% of ideal)

    4 threads (1 thread per module)
    6388711814.75 (94.0% of ideal)

    8 threads (1 thread per core, or, 2 threads per module)
    3378318936.375 (49.7% of ideal)

    ;
    I saw again the just half ideal, though the main-loop of the benchmark holds no memory=L2 writes. This result encourages the doubt that load units are shared.
       
    Store forwarding stalls of piledriver
    Author:  Date: 2014-09-07 10:11
    I found a blog that says store forwarding on Piledriver is improved from Bulldozer.
    blog.stuffedcow.net/2014/01/x86-memory-disambiguation/
    Unlike 186th page of "microarchitecture.pdf" where you store 32bits and load upper 16bits of the 32bits, the author of this blog store 64bits and load upper 32bits of the 64bits.
    Then he says the loading upper 32bits has no stall.
    I also reproduced it on FX-8350 with this code below (GNU as).

    .L5
    movq %rbx, b(%rip)
    movl b+4(%rip), %eax
    addq $1, a(%rip) #increase counter
    jmp .L5