Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Test results for AMD Bulldozer processor
Author: Agner Date: 2012-03-02 06:57

I have now got the time to test the AMD Bulldozer after being delayed by other projects.

The AMD Bulldozer is a major redesign of previous microarchitectures. The most notable points are

  • Aggressive power-saving features.
     
  • The chip has 2 - 8 "compute units" with two CPU cores each.
     
  • The code cache, instruction decoder, branch prediction unit and floating point execution unit are shared between two cores, while the level-1 data cache and the integer execution units are separate for each core.
     
  • A level-3 cache is shared between all compute units.
     
  • The pipeline can support 4 instructions per clock cycle.
     
  • Supports AVX instructions. Intel announced the AVX instruction set extension in 2008 and the AMD designers have had very little time to change their plans for the Bulldozer to support the new 256-bit vectors defined by AVX. The Bulldozer splits each 256-bit vector into two 128-bit vectors, as expected, but the throughput is still good because most floating point execution units are doubled so that the two parts can be processed simultaneously.
     
  • The maximum throughput is four 128-bit vectors or two 256-bit vectors per clock cycle if there is an equal mixture of integer vector and floating point vector operations. This throughput will probably be sufficient to service two threads in most cases.
     
  • Supports fused multiply-and-add instructions. These new instructions can do one addition and one multiplication in the same time that it otherwise takes to do one addition or one multiplication. It uses the FMA4 instuction codes designed by Intel, but unfortunately Intel have later changed their plans to FMA3, as discussed on this blog.
     
  • Introduces AMD's new XOP instruction set extension with many useful instructions. Unfortunately, these instructions will rarely be used because they are unlikely to be supported by Intel.
     
  • The 3DNow instruction set is no longer supported. I don't think anybody will miss it.
     
  • Improved branch prediction with two-level branch target buffer.
     
  • Register-to-register moves are translated into register renaming with zero latency. For years, I have wondered why no CPU did this (except for the FXCH instruction). Now the Bulldozer is the first x86 processor to implement this feature. It works very well with four register renamings per clock cycle, but only for 128-bit registers, not for general purpose registers, x87 registers or 256-bit registers.

The test results are mostly good and many weaknesses of previous designs have been eliminated. However, there are still some weak points and bottlenecks that need to be mentioned:

  • The power saving features are reducing the clock frequency most of the time. This often gives low and inconsistent results in benchmark tests because the clock frequency is varying.
     
  • Some operating systems are not aware that the chip shares certain resources between the two cores that make up a compute unit. The consequence is that the operating system may put two threads into one compute unit while another unit is idle, or it may put two threads with different priority into the same compute unit so that a low priority thread can steal resources from a high priority thread. I don't understand why there is no CPUID function for telling which resources are shared between CPU cores. The current solution where the operating system must know the details of every CPU on the market is not practical, and it does not work with virtual CPUs etc.
     
  • The shared instruction fetch unit can fetch up to 32 bytes per clock cycle or 16 bytes per core. This may be a bottleneck when both cores are active and when frequent jumps produce bubbles in the pipeline.
     
  • The decode unit can handle four instructions per clock cycle. It is alternating between the two threads so that each thread gets two instructions per clock cycle on average. This is a serious bottleneck because the rest of the pipeline can handle up to four instructions per clock.
     
  • Cache bank conflicts in the data cache are so frequent that it seriously degrades the performance in some tests.
     
  • The code cache has only two ways which may be insufficient to service two simultaneous threads.
     
  • The long pipeline causes long branch misprediction penalties.
     
  • The pipelines can handle four instructions per clock cycle, but there are only two integer ALUs where previous processors had three. This means that two of the four pipeline lanes will be idle most of the time in integer code.
     
  • Some floating point operations, such as shuffle, blend and booleans, are executed in the integer vector units. This causes an extra transport delay between the floating point vector unit and the integer vector unit.
 
thread Test results for AMD Bulldozer processor - Agner - 2012-03-02
replythread Test results for AMD Bulldozer processor new - Massimo - 2012-03-13
reply Test results for AMD Bulldozer processor new - Agner - 2012-03-14
last reply Test results for AMD Bulldozer processor new - Alex - 2012-03-14
replythread Test results for AMD Bulldozer processor new - fellix - 2012-03-15
last replythread Test results for AMD Bulldozer processor new - Agner - 2012-03-16
last replythread Test results for AMD Bulldozer processor new - Massimo - 2012-03-16
last replythread Test results for AMD Bulldozer processor new - Agner - 2012-03-17
reply Test results for AMD Bulldozer processor new - avk - 2012-03-17
last replythread Test results for AMD Bulldozer processor new - Massimo - 2012-03-17
last replythread Test results for AMD Bulldozer processor new - Agner - 2012-03-17
last replythread Test results for AMD Bulldozer processor new - Massimo - 2012-03-20
last replythread Test results for AMD Bulldozer processor new - Agner - 2012-03-21
last reply Cache WT performance of the AMD Bulldozer CPU new - GordonBGood - 2012-06-05
reply Test results for AMD Bulldozer processor new - zan - 2012-04-03
replythread Multithreads load-store throughput for bulldozer new - A-11 - 2014-06-27
last replythread Multithreads load-store throughput for bulldozer new - Bigos - 2014-06-28
last reply Multithreads load-store throughput for bulldozer new - A-11 - 2014-07-04
last reply Store forwarding stalls of piledriver new - A-11 - 2014-09-07