I have now got the time to test the AMD Bulldozer after being delayed by
other projects.
The AMD Bulldozer is a major redesign of previous microarchitectures. The
most notable points are
- Aggressive power-saving features.
- The chip has 2 - 8 "compute units" with two CPU cores each.
- The code cache, instruction decoder, branch prediction unit and floating
point execution unit are shared between two cores, while the level-1 data
cache and the integer execution units are separate for each core.
- A level-3 cache is shared between all compute units.
- The pipeline can support 4 instructions per clock cycle.
- Supports AVX instructions. Intel announced the AVX instruction set
extension in 2008 and the AMD designers have had very little time to change
their plans for the Bulldozer to support the new 256-bit vectors defined by
AVX. The Bulldozer splits each 256-bit vector into two 128-bit vectors, as
expected, but the throughput is still good because most floating point
execution units are doubled so that the two parts can be processed
simultaneously.
- The maximum throughput is four 128-bit vectors or two 256-bit vectors per
clock cycle if there is an equal mixture of integer vector and floating
point vector operations. This throughput will probably be sufficient to
service two threads in most cases.
- Supports fused multiply-and-add instructions. These new instructions can
do one addition and one multiplication in the same time that it otherwise
takes to do one addition or one multiplication. It uses the FMA4
instuction codes designed by Intel, but unfortunately Intel have later
changed their plans to FMA3, as discussed on this
blog.
- Introduces AMD's new XOP instruction set extension with many useful
instructions. Unfortunately, these instructions will rarely be used because
they are unlikely to be supported by Intel.
- The 3DNow instruction set is no longer supported. I don't think anybody
will miss it.
- Improved branch prediction with two-level branch target buffer.
- Register-to-register moves are translated into register renaming with zero
latency. For years, I have wondered why no CPU did this (except for the FXCH
instruction). Now the Bulldozer is the first x86 processor to implement this
feature. It works very well with four register renamings per clock cycle,
but only for 128-bit registers, not for general purpose registers, x87
registers or 256-bit registers.
The test results are mostly good and many weaknesses of previous designs have
been eliminated. However, there are still some weak points and bottlenecks that
need to be mentioned:
- The power saving features are reducing the clock frequency most of the
time. This often gives low and inconsistent results in benchmark tests
because the clock frequency is varying.
- Some operating systems are not aware that the chip shares certain
resources between the two cores that make up a compute unit. The consequence
is that the operating system may put two threads into one compute unit while
another unit is idle, or it may put two threads with different priority into
the same compute unit so that a low priority thread can steal resources from
a high priority thread. I don't understand why there is no CPUID function
for telling which resources are shared between CPU cores. The current
solution where the operating system must know the details of every CPU on
the market is not practical, and it does not work with virtual CPUs etc.
- The shared instruction fetch unit can fetch up to 32 bytes per clock cycle
or 16 bytes per core. This may be a bottleneck when both cores are active
and when frequent jumps produce bubbles in the pipeline.
- The decode unit can handle four instructions per clock cycle. It is
alternating between the two threads so that each thread gets two
instructions per clock cycle on average. This is a serious bottleneck
because the rest of the pipeline can handle up to four instructions per
clock.
- Cache bank conflicts in the data cache are so frequent that it seriously
degrades the performance in some tests.
- The code cache has only two ways which may be insufficient to service two
simultaneous threads.
- The long pipeline causes long branch misprediction penalties.
- The pipelines can handle four instructions per clock cycle, but there are
only two integer ALUs where previous processors had three. This means that
two of the four pipeline lanes will be idle most of the time in integer code.
- Some floating point operations, such as shuffle, blend and booleans, are
executed in the integer vector units. This causes an extra transport delay
between the floating point vector unit and the integer vector unit.
|