Agner`s CPU blog

Cache WT performance of the AMD Bulldozer CPU

Author: Date: 2012-06-05 05:42

Agner wrote

Cache bank conflicts in the data cache are so frequent that it seriously degrades the performance in some tests.

Massimo wrote

What do you think about the L1D WT choice with higher latency (coupled with a WCC halfaway the L2)? Does it impact much the speed for you?

Alex wrote

Because of WT, stores have to be send to the L2, but the L2 can probably only handle *one* store instruction per clock, not 2.

fellix wrote

This, combined with the WT policy...

Massimo wrote

@Alex: No, because L1D stores are sent to a 4KB WB buffer for coalescing before L2 -that's why L1D is WT, of course. It might be interesting to do 'overflow' such buffer and see what happens. Could the full load LS fractional number depend on a hidden latency that happens from time to time when WCC is forced to free a line down to L2?

Massimo wrote

Still, I cannot understand the huge IPC penalty of BD over SB. The LS is almost the same since nehalem (2R/1R1W), BD has a VERY slow REP MOVS (1/3 of SB, sounds very worrying if you consider that mem/str/array copies are still implemented with rep movx), but it cannot account for the performance loss.

Definition of terms: BD = AMD Bulldozer, SB = Intel Sandy Bridge, L1D = L1 Data Cache, WT = Write-Through policy as compared to WB = Write-Back policy, L2 = L2 cache, KB = KiloByte, WCC = Write Coalescing Cache (an AMD term), IPC = Instructions Per Clock (cycle), REP MOVSX = repeating x86 instructions that along with REP STOSX instructions are often used to quickly copy or set the contents of strings or arrays.

To all: I think you are onto something in wondering if the new cache organization is responsible for the loss of BD performance. Specifically, it seems that many (Alex, et al) don't understand, as I also did not at first, that WT cache policy applies to all caches, which means that when the WCC buffer overruns that the output is written both to the common L2 cache and also to main memory with its 161 clock cycle latency.

It seems that the AMD Bulldozer optimization manual (search for Software Optimization Guide for AMD Family 15h Processors download) recommends discrete unrolled writes for small copies and using the library functions (usually implemented with the REP MOVSX and REP STOSX instructions) for larger moves. I wondered why, so I used the following little C++ test program to investigate the speed of write operations in BD:

#define LOOPS 100 #define ITERS = 40000000 #define L1SIZE 16384 #define BUFSIZE = 4096 // L1SIZE / sizeof(unsigned int)

unsigned int buf[BUFSIZE];

for (unsigned int l = 0; l < LOOPS; ++l) for (unsigned int i = 0; i < ITERS * BUFSIZE; i += BUFSIZE) memset(buf, 1, L1SIZE); // glibc will convert this to REP STOSX instruction for the most part...

along with a loop that counts the contents of the buf array outside of the timing so that the C++ compiler doesn't optimize away the work in the loop. We find that it the fill array operation fills at a rate of four bytes per clock cycle for either 32 or 64 bit code, which isn't too bad. Yes, SB may be faster in using more specialized optimizations, but this rate will get the job done for most memory, array, or string initializations. Let's look at this again after we visit the loop way of performing this, as not all work can be done using REP MOV or STO instructions. Changing the loop code to the following:

for (unsigned int l = 0; l < LOOPS; ++l) for (unsigned int i = 0; i < ITRS; i += BUFSIZE) for (unsigned int j = 0; j < BUFSIZE; ++j) buf[j] = j; // in order to prevent the compiler from optimizing this into a memset() function

reveals that it is taking about 16 clock cycles to initialize each four byte member of the array and the time isn't all that much different no matter if 32-bit or 64-bit code is used, which is surprising considering that the inner loop should only be a little more than two clock cycles in duration. This proves that BD has some special optimization for the REP MOVSX and STOSX instructions that bypass the cache limitations even though these optimizations don't quite take it to the level of SB which reputedly initializes by cache line size per cycle. Apparently, the reason that AMD recommend the use of the REP instructions for larger copy and fill operations is that there is overhead to these optimizations in forwarding the results back to the caches and thus they only make sense for larger blocks.

Let's show this by changing the array to an array of chars as follows:

#define BUFSIZE = 16384 // L1SIZE / sizeof(unsigned char)

unsigned char buf[BUFSIZE];

and we find that the run time doesn't change hardly at all even though four times as many loops are being used to fill the same array size in bytes. In other words, the limit on these memory fill operations is not CPU execution rate but rather something else.

Let's analyse this in light of what we know: That this array fills the 16 KB L1D, which has a WT policy but is backed by the (only) 4 KB WCC that combines successive writes into the 64 byte cache line size. Thus, sixteen of these four-byte double word writes are being combined into a single write, meaning that once the WCC overflows it will be outputting a cache line each sixteen inner loops. Since the inner loop time without the memory write stall would be something like 40 clock cycles, this single thread is stalled for a high percentage of its time waiting for memory, which is not the L2 with a latency of about 20 clock cycles but the main memory. This latency is as for different sizes of array element up to quad words right down to byte level.

Now let's look at how this WT memory latency impacts on multi-threading performance by turning on eight cores using OpenMP:

#define NUMTHREADS 8

for (unsigned int l = 0; l < LOOPS; ++l) {

#pragma omp parallel if(NUMTHREADS > 1) num_threads(NUMTHREADS) reduction(+ : count) private(buf) { unsigned char buf[BUFSIZE];

#pragma omp for schedule(static, 1) nowait for (int chnk = 1; chnk <= NUMTHREADS; ++chnk) { for (unsigned int p = 0; p < (NUMPRIMES / NUMTHREADS); p += BUFSIZE) for (unsigned int j = 0; j < BUFSIZE; ++j) buf[j] = j; // in order to prevent the compiler from optimizing this into a memset() function }

for (unsigned int i = 0; i < BUFSIZE; ++i) // necessary to prevent compiler from optimizing work away count += buf[i];

} }

which shows a net gain of something less than a factor of two in spite of eight cores sharing the work and this is true no matter if the buf array is private to the threads as shown here or refers back to the global buf array. This reveals that the whole process is memory latency bound and can't take a very great advantage of multi-threading. Going back to the REP MOVSX/STOSX way of doing things, even though it is many times faster due to bypassing the caches for large blocks of memory, it is also memory latency bound such that multiple threads sharing main memory are also not scalable by the number of threads used when initializing memory.

Of course, this is a trivial example as memory fill operations aren't normally a large percentage of what an algorithm does. However, I feel that these little examples show the BD limitations in a larger sense, specifically why only operations that have fairly complex loops but generate a quite limited number of final write operations (intermediate write operations will be combined away by the WCC if the loop is written correctly) are the main applications that can take advantage of BD's larger number of cores - applications such as 7Zip or WinRar if written to be able to take advantage of this. Applications that have fairly simple loops and generate a higher number of final write operations will have this memory latency bottleneck and not be able to take full advantage of the larger number of cores, especially if those operations require skipping up through memory with a span larger than 64 bytes as in prime number sieve culling (such as primesieve).

Algorithms that can be written to use no more than two KB of L1D won't have this limitation as the WCC will be able to fully buffer the L1D even if both cores of the pair are used, but it is not always possible to minimize an algorithm to this level.

The very best fix for this, given that the shared caches and the WT write policy are buried deeply into the BD design right from the beginning and changing their design would require scrapping BD, would be to increase the WCC to a 32 KB size and perhaps even split it into two 16 KB WCC's each servicing each of the cores if that gives an advantage. The cost would be something about one million (high speed = fairly high power consumption) transistors per core, but that seems reasonably trivial considering that there are approximately 60 to 70 million transistors per core after excluding those used by the L2 and L3 caches and support circuitry.

Reply To This Message

Previous Message

Next Message