I have now got an opportunity to test the new Sandy Bridge processor from
Intel, and the results are very interesting. There are many improvements - and few drawbacks. I have updated my manuals with the details, but let me just
summarize the main findings here:
- New micro-op cache
- The decoders translate the CISC style instructions to RISC style
micro-operations. The Sandy Bridge has a new cache for storing decoded
micro-operations after the decoders, but the traditional code cache before
the decoders is still there. The micro-op cache turns out to be very efficient in my tests. It is easy to
obtain a throughput of 4, or even 5, instructions per clock cycle as long as
the code fits into the micro-op cache.
- Decoders
- While the throughput is improved quite a lot for code that fits into the
micro-op cache, it is not improved in situations where the critical code
is too big for the micro-op cache (but not too big for the level-1 code
cache). The decoders in the Sandy Bridge are almost identical to the
design in previous processors with the same limitation of 16 bytes per clock
cycle. The maximum throughput of 4 or 5
instructions per clock cycle is rarely obtained. The difference in performance
between code that fits into the micro-op cache and code that doesn't makes the
micro-op cache a precious resource. It is so important to economize the use of the
micro-op cache that I would give the advice never to unroll loops.
- Macro-fusion
- There is one improvement in the decoders, though. It is possible to fuse
two instructions into one micro-op in more cases than before. For example,
an ADD or SUB instruction can be fused with a conditional jump into one
micro-op. This makes it possible to make a loop where the overhead of the
loop counter and exit condition is just one micro-op.
- Branch prediction
- The branch predictor has bigger history buffers than in previous
processors, but the special loop predictor is no longer there. The misprediction penalty is
somewhat shorter for code that resides in the micro-op cache.
- AVX instruction set
- The new AVX instruction set extends the vector registers from 128 bits to
256 bits. The floating point execution units have full 256-bit bandwidth.
This means that you can do calculations on vectors of eight single-precision
or four double-precision numbers with a throughput of one vector addition
and one vector multiplication per clock cycle. I found that this doubled
throughput is obtained only after a warm-up period of several hundred
floating point operations. In the "cold" state, the throughput is
only half this value, and the latencies are one or two clocks longer. My
guess is that the Sandy Bridge is saving power by turning off the most
expensive execution units when they are not needed, and it turns on the full
execution power only when the load is heavy. This is my guess only - I have
found no official mentioning of this warm-up effect.
Another advantage of the AVX instruction set is that all vector instructions now have a non-destructive version with three operands
where the destination is stored in a separate register. Instead of A = A +
B, we now have C = A + B, so that the value of A is not overwritten by the
result. This saves a lot of register moves.
A disadvantage of the AVX instruction set is that all vector instructions
now have two versions, a non-destructive AVX version and a two-operand
non-AVX version, and you are not supposed to mix these two versions. If the
programmer inadvertently mixes AVX and non-AVX vector instructions in the
same code then
there is a penalty of 70 clock cycles for each transition between the two
forms. I bet that this will be a very common programming error in the future
- and an error that is quite difficult to detect because the code still
works, albeit slower.
- More memory ports
- The Sandy Bridge has two memory read ports where previous Intel processors
have only one. The maximum throughput is now 256 bits read and 128 bits
write per clock cycle. The flipside of this coin is that the risk of
contentions in the data cache increases when there are more memory
operations per clock cycle. In my tests, it was quite difficult to maintain
the maximum read and write throughput without being delayed by cache bank
conflicts.
- Misaligned memory operands handled efficiently
- On the Sandy Bridge, there is no performance penalty for reading or
writing misaligned memory operands, except for the fact that it uses more
cache banks so that the risk of cache conflicts is higher when the operand
is misaligned. Store-to-load forwarding also works with misaligned operands
in most cases.
- Register read ports
- Previous Intel processors have a serious - and often neglected -
bottleneck in the register read ports. Ever since the Pentium Pro processor
back in 1995, the Intel family 6 processors have had a limitation of 2 or 3
reads from the permanent register file per clock cycle. This bottleneck has
finally been removed in the Sandy Bridge.
- Zeroing instructions
- An instruction that subtracts a register from itself will always give
zero, regardless of the previous value of the register. This is
traditionally a common way of setting a register to zero. Many modern
processors recognize that this instruction doesn't have to wait for the
previous value of the register. What is new in the Sandy Bridge is that it
doesn't even execute this instruction. The register allocater simply
allocates a new empty register for the result without even sending it to the
execution units. This means that you can do four zeroing instructions per
clock cycle without using any execution resources. NOPs are treated in
the same efficient way without using any execution unit.
This technique is not new, actually. It has been used for many years with
the FXCH instruction (exchange floating point registers). There are special
reasons for resolving the FXCH instruction in the register allocater/renamer,
but it is funny that this technique hasn't been extended to other uses until
now. It would be obvious to use this technique for register-to-register
moves too, but so far we have not seen such an application.
- Data transport delay
- Most modern processors have different execution unit clusters or domains
for different types of data or different types of registers, e.g. integer
and floating point. Many processors have a delay of one or two clocks for moving
data from one such domain to another. These delays are diminished in the
Sandy Bridge and in some cases completely removed. I found that it is
possible to move data between integer registers and vector registers without
any delay.
- Writeback conflicts
- When two micro-operations with different latencies run in the same
execution port then they may both finish at the same time. This leads to a
conflict when both need the writeback port and the result bus at the same
time. Both Intel and AMD processors have this problem. The Sandy Bridge can
avoid most writeback conflicts by fixing execution latencies to standard
values, by allowing writeback to different execution domains simultaneously,
and by delaying writeback when there is a conflict.
- Floating point underflow and denormal numbers
- Denormal numbers are floating point numbers that are coded in a non-normal
way which is used when the value is close to underflow, according to the
official IEEE 754 standard. Most processors are unable to handle floating point
underflow, denormal numbers, and other special cases in the general floating
point execution units. These special cases are typically handled by
microcode exceptions at the cost of 150 - 200 clocks per instruction. The
Sandy Bridge can handle many of these special cases in hardware without any
penalty. In my tests, the cases of underflow and denormal numbers were
handled just as fast as normal floating point numbers for addition, but not for multiplication.
My conclusion is that the Sandy Bridge processor has many significant
improvements over previous processors. The most serious bottlenecks and
weaknesses of previous processors have been removed. The micro-op cache turns
out to be an important improvement for relatively small loops. Unfortunately, the poor performance of the decoders has not been improved. This remains a likely bottleneck for code that doesn't fit into the micro-op cache
The decoding of instruction lengths has been a problem in Intel processors
for many years. They tried to fix the problem with the trace cache in the
Pentium 4, which turned out to be a dead end street, and now the apparently more
successful micro-op cache in the Sandy Bridge. AMD have solved the problem of
detecting instruction lengths in their processors by marking instruction
boundaries in the code cache. Intel did the same in the Pentium MMX back in
1996, and it is a mystery to me why they are not using this solution today.
There would hardly be a need for the micro-op cache if they had instruction
boundaries marked in the code cache.
Whenever the narrowest bottleneck of a system is removed then the next less
narrow bottleneck becomes visible. This is also the case here. As the memory
read bandwidth is doubled, the risk of cache bank conflicts is increased. Cache conflicts was actually the limiting factor in some of my tests.
It has struck me that the new Sandy Bridge design is actually under-hyped. I
would expect a new processor design with so many improvements to be advertised
aggressively, but the new design doesn't even have an official brand name. The
name Sandy Bridge is only an unofficial code name. In Intel documents it is
variously referred to as "second generation Intel Core processors",
"2xxx series", and "Intel microarchitecture code name Sandy
Bridge". I have never understood what happens in Intel's marketing
department. They keep changing their nomenclature, and they use the same brand names
for radically different technical designs. In this case they have no reason to
obscure technical differences. How can they cash in on the good reputation of
the Sandy Bridge design when it doesn't even have an official name?
[Corrected on June 08, 2011, and Mar 2, 2012]. |