Software optimization resources | E-mail subscription to this blog | www.agner.org
Threaded View | Search | List | List Messageboards | Help |
Test results for AMD Bulldozer processor |
---|
Author: Agner | Date: 2012-03-02 06:57 |
I have now got the time to test the AMD Bulldozer after being delayed by other projects. The AMD Bulldozer is a major redesign of previous microarchitectures. The most notable points are
The test results are mostly good and many weaknesses of previous designs have been eliminated. However, there are still some weak points and bottlenecks that need to be mentioned:
|
Reply To This Message |
Test results for AMD Bulldozer processor |
---|
Author: | Date: 2012-03-13 12:43 |
Hi Agner, I'd like to know your opinion about few things I was thinking about BD's architecture: * What do you think of BD's AGU not being able to issue LS-related instructions like mov r/m? i.e. K10 could issue memory instructions in AGU, whereas BD cannot - and PD the same (just mov r/r and such takes AGU path for renaming, i think). From BD manual, almost no instruction ends up alone in AGU (contrary to K10). it seems to me they moved toward having a fixed max 2 instr throughtput/core, a huge stepdown from previous (ideal) 6. Considering that MOVs are everywhere and decode fast, it seems to me a huge limit to overall IPC. On a last note: I was thinking of BD's IPC - 2ALU+2ALU(+2 FPU but they share LS with ALU..). SB could sustain 4 instr /cycle in loops thanks to the TC, but the BD decoder would likely trounce the IPC to 1,x/core no? Is it the shared decoder the bigger stopper for BD, or the reworked AGU? Thanks, Hi Agner, I've seen you updated the instruction table - and it seems different from AMD one! So MOV r/m is issued in AGU... but mov m/r is not??? |
Reply To This Message |
Test results for AMD Bulldozer processor |
---|
Author: Agner | Date: 2012-03-14 01:40 |
Massimo wrote:Do you think AMD will add a trace cache to fix the bad dual-core decoder throughput like intel did?Decoding is often a bottleneck in CISC designs. The trace cache on Intels Netburst (P4) was not very successful. I think it would be better to have one set of decoders per thread in the Bulldozer. AMD has instruction boundaries marked in the code cache which, strangely, Intel don't. So an extra set of decoders would be just a matter of die space and power consumption and it would greatly increase the throughput. It is strange that the floating point throughput is higher than the integer throughput on the Bulldozer. Later versions of Bulldozer can also do register-to-register moves in the two AGU pipelines, according to AMD manuals. I guess they will add more instructions to these pipelines to get a 4 instruction integer throughput in the future. Others have criticized the cache design on the Bulldozer. I am not an expert in cache performance so I will not comment on that. |
Reply To This Message |
Test results for AMD Bulldozer processor |
---|
Author: | Date: 2012-03-14 09:07 |
Massimo wrote:* What do you think about the L1D WT choice with higher latency (coupled with a WCC halfaway the L2)? Does it impact much the speed for you?In his analysis Agner also wrote about an instruction-throughput penalty with both cores active. Instead of 4 instructions per clock, he could only measure around ~3 instr. per clock on average. I speculate that this is the effect of the L1's WT strategy. Because of WT, stores have to be send to the L2, but the L2 can probably only handle *one* store instruction per clock, not 2. Thus, only 3 instr. instead of 4 per module. Agner also reported a maximum of ~3.6-3.7 instructions. Maybe he got more loads than the usual 2:1 load to store ratio in that case. But I dont know his code so I cant say for sure, only speculate. |
Reply To This Message |
Test results for AMD Bulldozer processor |
---|
Author: fellix | Date: 2012-03-15 15:01 |
The L1D cache in BD is probably also under-performing by its own merit, compared to the K10 implementation. Truly its associativity is doubled but the size is 1/4 of the previous architecture, witch yields a lower overall hit rate than the old 64KByte 2-way solution. This, combined with the WT policy, that relies too much on the anemic and latent L2 cache makes the whole memory pipeline quite inefficient and hogs the data flow in many corner cases. The sheer size of the caches in BD is simply inadequate to compensate for the poor overall design. I think the L2 caches are the main stumbling block for the architecture in BD, additionally burdened to handle all the snoop traffic, since the L2-to-L3 relation is [mostly] exclusive. The good thing is that the HW prefetching in BD is more flexible now, and can fetch data directly in to the L2 (probably one of the reasons for AMD to make them so large). Sill, all this is a far cry from what Intel has achieved over the years in both efficiency and wide scaling across the product range. Bulldozer is simply a chaotic patch-work of counterintuitive ideas with no leading prospects. |
Reply To This Message |
Test results for AMD Bulldozer processor |
---|
Author: Agner | Date: 2012-03-16 01:18 |
fellix wrote:Bulldozer is simply a chaotic patch-work of counterintuitive ideas with no leading prospects.This is an opinion of an anonymous poster. I prefer to see experimental evidence. |
Reply To This Message |
Test results for AMD Bulldozer processor |
---|
Author: | Date: 2012-03-16 17:01 |
@Alex: No, because L1D stores are sent to a 4KB WB buffer for coalescing before L2 -that's why L1D is WT, of course. It might be interesting to do 'overflow' such buffer and see what happens. Could the full load LS fractional number depend on a hidden latency that happens from time to time when WCC is forced to free a line down to L2? @Fellix: BD architecture looks interesting and innovative. Would you mind to share your detailed MiArch comparison manuals with us? Agner, re-reading the manuals I noticed a point I did overlook: the BD decoder is NOT evolved in a 2-1-1-1 (4 instruction) like the IA, but it's still a 2-1-1, so the "4 instr/cycle" is actually a double-path and two single path! ..in essence, such decoder can at BEST 'pump up' 1,5 instructions/cycle to the ALU on a full load, max 3 on a single-core load if no double-path instructions are crossed. odd. |
Reply To This Message |
Test results for AMD Bulldozer processor |
---|
Author: Agner | Date: 2012-03-17 01:27 |
Massimo wrote:the BD decoder is NOT evolved in a 2-1-1-1 (4 instruction) like the IA, but it's still a 2-1-1, so the "4 instr/cycle" is actually a double-path and two single path!The decoder can do 2-1-1 or 1-1-1-1, but not 2-2. So the total decoder throughput is 4 unless there are many double instructions. |
Reply To This Message |
Test results for AMD Bulldozer processor |
---|
Author: | Date: 2012-03-17 03:35 |
What about the other decoding schemes: 1-1-2, 1-2-1? Can BD work with them too? BTW, how do you think: is there any chance that AMD soon will implement the register renaming for YMM? If yes, will it help to implement the 2-2 scheme? Or should we wait for the 128->256 bit broadening somewhen in the 22/20 nm BD's derivative? |
Reply To This Message |
Test results for AMD Bulldozer processor |
---|
Author: | Date: 2012-03-17 04:30 |
Agner wrote:The decoder can do 2-1-1 or 1-1-1-1I'm losing you here - do you mean the first decoder can actually fetch one 2-MOP or two 1MOP instruction and decode them? how they did that, do they tag path length too? Even in that case, how could it...??? Let me go straight on example: |
Reply To This Message |
Test results for AMD Bulldozer processor |
---|
Author: Agner | Date: 2012-03-17 12:41 |
avk wrote:
What about the other decoding schemes: 1-1-2, 1-2-1? Can BD work with them too?I don't think so. Haven't tried. BTW, how do you think: is there any chance that AMD soon will implement the register renaming for YMM? If yes, will it help to implement the 2-2 scheme? Or should we wait for the 128->256 bit broadening somewhen in the 22/20 nm BD's derivative?All YMM instructions generate two mops, regardless of renaming. A sequence of YMM instructions can not decode as 2-2 so it is less efficient than the corresponding XMM instructions. They will probably fix this somehow in a later version when YMM instructions become more common. It would be much cheaper to improve the decoders than to make a full 256-bit databus and execution unit. Massimo wrote: do you mean the first decoder can actually fetch one 2-MOP or two 1MOP instruction and decode them?I guess it works like this: There are 4 parallel decode lines. When it finds a double instruction in the first of the 4 lines, and single instructions in the next two lines, it will generate 2-1-1. If it finds a double instruction in any of the other lines, it will delay it to the next clock cycle and put it in line 1. |
Reply To This Message |
Test results for AMD Bulldozer processor |
---|
Author: | Date: 2012-03-20 09:16 |
Hi, thanks - I see now what you mean. 4 tagged instructions goes to decoder 0-3; if decoder 0 gets 2MOP then decoder 3 resources are used/stalled. So, it's likely a decoder (3) shares the bus with decoder 0 for outputting MOPS to the OOOE scheduler. Still, I cannot understand the huge IPC penalty of BD over SB. The LS is almost the same since nehalem (2R/1R1W), BD has a VERY slow REP MOVS (1/3 of SB, sounds very worrying if you consider that mem/str/array copies are still implemented with rep movx), but it cannot account for the performance loss. The lack of the 3rd ALU is important, yet OOOE could easily mix MOV and other instructions in between - for full(?) throughput I need to schedule asm instructions manually on IA. So, while CMT is surely slown down alot by the decoder - what do you think of single-core performance pitfall, even with regards to K10 architecture? BD decoder/retirement seems better than K10 (max 4 MOPS), yet it lags behind. |
Reply To This Message |
Test results for AMD Bulldozer processor |
---|
Author: Agner | Date: 2012-03-21 01:59 |
Massimo wrote:BD has a VERY slow REP MOVSI think Sandy Bridge has a special implementation of REP MOVS, moving a whole cache line at a time under certain conditions. Many function libraries implement memcpy as a loop of aligned xmm moves, which is efficient on all processors. You can still get a throughput of 4 instructions per clock on Bulldozer on a single thread if you mix integer and floating point instructions, og you mix different type of vector instructions. |
Reply To This Message |
Cache WT performance of the AMD Bulldozer CPU |
---|
Author: | Date: 2012-06-05 05:42 |
Agner wrote
Cache bank conflicts in the data cache are so frequent that it seriously degrades the performance in some tests.Massimo wrote What do you think about the L1D WT choice with higher latency (coupled with a WCC halfaway the L2)? Does it impact much the speed for you?Alex wrote Because of WT, stores have to be send to the L2, but the L2 can probably only handle *one* store instruction per clock, not 2.fellix wrote This, combined with the WT policy...Massimo wrote @Alex: No, because L1D stores are sent to a 4KB WB buffer for coalescing before L2 -that's why L1D is WT, of course. It might be interesting to do 'overflow' such buffer and see what happens. Could the full load LS fractional number depend on a hidden latency that happens from time to time when WCC is forced to free a line down to L2?Massimo wrote Still, I cannot understand the huge IPC penalty of BD over SB. The LS is almost the same since nehalem (2R/1R1W), BD has a VERY slow REP MOVS (1/3 of SB, sounds very worrying if you consider that mem/str/array copies are still implemented with rep movx), but it cannot account for the performance loss. Definition of terms: BD = AMD Bulldozer, SB = Intel Sandy Bridge, L1D = L1 Data Cache, WT = Write-Through policy as compared to WB = Write-Back policy, L2 = L2 cache, KB = KiloByte, WCC = Write Coalescing Cache (an AMD term), IPC = Instructions Per Clock (cycle), REP MOVSX = repeating x86 instructions that along with REP STOSX instructions are often used to quickly copy or set the contents of strings or arrays. To all: I think you are onto something in wondering if the new cache organization is responsible for the loss of BD performance. Specifically, it seems that many (Alex, et al) don't understand, as I also did not at first, that WT cache policy applies to all caches, which means that when the WCC buffer overruns that the output is written both to the common L2 cache and also to main memory with its 161 clock cycle latency. It seems that the AMD Bulldozer optimization manual (search for Software Optimization Guide for AMD Family 15h Processors download) recommends discrete unrolled writes for small copies and using the library functions (usually implemented with the REP MOVSX and REP STOSX instructions) for larger moves. I wondered why, so I used the following little C++ test program to investigate the speed of write operations in BD:
unsigned int buf[BUFSIZE]; for (unsigned int l = 0; l < LOOPS; ++l) along with a loop that counts the contents of the buf array outside of the timing so that the C++ compiler doesn't optimize away the work in the loop. We find that it the fill array operation fills at a rate of four bytes per clock cycle for either 32 or 64 bit code, which isn't too bad. Yes, SB may be faster in using more specialized optimizations, but this rate will get the job done for most memory, array, or string initializations. Let's look at this again after we visit the loop way of performing this, as not all work can be done using REP MOV or STO instructions. Changing the loop code to the following:
reveals that it is taking about 16 clock cycles to initialize each four byte member of the array and the time isn't all that much different no matter if 32-bit or 64-bit code is used, which is surprising considering that the inner loop should only be a little more than two clock cycles in duration. This proves that BD has some special optimization for the REP MOVSX and STOSX instructions that bypass the cache limitations even though these optimizations don't quite take it to the level of SB which reputedly initializes by cache line size per cycle. Apparently, the reason that AMD recommend the use of the REP instructions for larger copy and fill operations is that there is overhead to these optimizations in forwarding the results back to the caches and thus they only make sense for larger blocks. Let's show this by changing the array to an array of chars as follows:
unsigned char buf[BUFSIZE]; for (unsigned int l = 0; l < LOOPS; ++l) and we find that the run time doesn't change hardly at all even though four times as many loops are being used to fill the same array size in bytes. In other words, the limit on these memory fill operations is not CPU execution rate but rather something else. Let's analyse this in light of what we know: That this array fills the 16 KB L1D, which has a WT policy but is backed by the (only) 4 KB WCC that combines successive writes into the 64 byte cache line size. Thus, sixteen of these four-byte double word writes are being combined into a single write, meaning that once the WCC overflows it will be outputting a cache line each sixteen inner loops. Since the inner loop time without the memory write stall would be something like 40 clock cycles, this single thread is stalled for a high percentage of its time waiting for memory, which is not the L2 with a latency of about 20 clock cycles but the main memory. This latency is as for different sizes of array element up to quad words right down to byte level. Now let's look at how this WT memory latency impacts on multi-threading performance by turning on eight cores using OpenMP:
for (unsigned int l = 0; l < LOOPS; ++l) #pragma omp parallel if(NUMTHREADS > 1) num_threads(NUMTHREADS) reduction(+ : count) private(buf) #pragma omp for schedule(static, 1) nowait for (unsigned int i = 0; i < BUFSIZE; ++i) // necessary to prevent compiler from optimizing work away } which shows a net gain of something less than a factor of two in spite of eight cores sharing the work and this is true no matter if the buf array is private to the threads as shown here or refers back to the global buf array. This reveals that the whole process is memory latency bound and can't take a very great advantage of multi-threading. Going back to the REP MOVSX/STOSX way of doing things, even though it is many times faster due to bypassing the caches for large blocks of memory, it is also memory latency bound such that multiple threads sharing main memory are also not scalable by the number of threads used when initializing memory. Of course, this is a trivial example as memory fill operations aren't normally a large percentage of what an algorithm does. However, I feel that these little examples show the BD limitations in a larger sense, specifically why only operations that have fairly complex loops but generate a quite limited number of final write operations (intermediate write operations will be combined away by the WCC if the loop is written correctly) are the main applications that can take advantage of BD's larger number of cores - applications such as 7Zip or WinRar if written to be able to take advantage of this. Applications that have fairly simple loops and generate a higher number of final write operations will have this memory latency bottleneck and not be able to take full advantage of the larger number of cores, especially if those operations require skipping up through memory with a span larger than 64 bytes as in prime number sieve culling (such as primesieve). Algorithms that can be written to use no more than two KB of L1D won't have this limitation as the WCC will be able to fully buffer the L1D even if both cores of the pair are used, but it is not always possible to minimize an algorithm to this level. The very best fix for this, given that the shared caches and the WT write policy are buried deeply into the BD design right from the beginning and changing their design would require scrapping BD, would be to increase the WCC to a 32 KB size and perhaps even split it into two 16 KB WCC's each servicing each of the cores if that gives an advantage. The cost would be something about one million (high speed = fairly high power consumption) transistors per core, but that seems reasonably trivial considering that there are approximately 60 to 70 million transistors per core after excluding those used by the L2 and L3 caches and support circuitry. |
Reply To This Message |
Test results for AMD Bulldozer processor |
---|
Author: | Date: 2012-04-03 05:24 |
Agner wrote:The hardware locality project[1] attempts to deal with this to an extent, but so far it is only really used by a few, although it has been rolled into OpenMPI. I would agree that it isn't really ideal though. I don't expect the application programmers to start thinking about this stuff any time soon. |
Reply To This Message |
Multithreads load-store throughput for bulldozer |
---|
Author: | Date: 2014-06-27 23:59 |
I wondered why load/store throughput is down when multiple threads are active, as you said at "14.17 Cache and memory access" in microarchitecture.pdf. So I wrote a simple benchmark test below. This counts load-store loops done in 1 second.
I used gcc-4.7.3 compiler. The loop-store loop was compiled so tiny that bulldozer's instruction prefetcher becomes bottle neck.
So, I unrolled (and rewrote) the loop.
On bulldozer, each "addq mem, reg ; movq reg, mem ;" instruction sequence generates 2 EX micro-opt and 2 AGU micro-opt with 1 load and 1 store, which spends no more and no less than the 4 pipeline ports(2 EX and 2 AGU) and 2 L1Dcache ports in one bulldozer core(not module = 2 cores). So, I expected the throughput is 1 clock per the load-store loop.And my measurement on FX-8350@3.4Ghz(Sorry for piledriver not bulldozer.) results like below.
While the case a thread per module earned 90%+ of the ideal value, 2 threads per module only differed from a half of the ideal within 2%. The half value means the two load-store units per module share some important units, which limit the load-store throughput of a entire module. I doubt the load-store unit providing 2 ports is shared among 2 cores in a module like the instruction decoder. |
Reply To This Message |
Multithreads load-store throughput for bulldozer |
---|
Author: Bigos | Date: 2014-06-28 02:46 |
What about L2 cache? Since L1D is a write-through design, any write to this cache will be automatically written to L2, which is shared between cores in a module. Of course there is a coalescing buffer, but IIUC it is meant to coalesce scattered writes to different addresses into a single transaction. In your program, you only write to a single address, so it's not effective. |
Reply To This Message |
Multithreads load-store throughput for bulldozer |
---|
Author: | Date: 2014-07-04 09:13 |
Agner measured in the same chapter that L2 write throughput is 1 line(64 bytes) per 12 clocks. But as we saw, my program wrote almost 64 bytes per 8 clocks (8 bytes per clock) which exceeds Agner's measurement. So, the buffer did more than coalescing, for example, eliminating duplicated writes on same address. By the way, I also wrote another benchmark which has no memory writes. So this benchmark must avoid L2 write-through access.
It needs some gcc extension for global registers, but gcc-4.7.3 I used dumps invalid loops.
Fix it.
As each bulldozer core has 2 read ports, I expected "add mem reg ; add mem reg" sequence per core spends 1 clock. The result of my measurements are below. ; I saw again the just half ideal, though the main-loop of the benchmark holds no memory=L2 writes. This result encourages the doubt that load units are shared. |
Reply To This Message |
Store forwarding stalls of piledriver |
---|
Author: | Date: 2014-09-07 10:11 |
I found a blog that says store forwarding on Piledriver is improved from Bulldozer. blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ Unlike 186th page of "microarchitecture.pdf" where you store 32bits and load upper 16bits of the 32bits, the author of this blog store 64bits and load upper 32bits of the 64bits. Then he says the loading upper 32bits has no stall. I also reproduced it on FX-8350 with this code below (GNU as). .L5 |
Reply To This Message |
Threaded View | Search | List | List Messageboards | Help |