Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog |

thread Test results for Intel's Sandy Bridge processor - Agner - 2011-01-30
reply Test results for Intel's Sandy Bridge processor - PaulR - 2011-02-15
replythread AVX2 - phis - 2011-06-23
last reply AVX2 - Agner - 2011-06-23
replythread Test results for Intel's Sandy Bridge processor - anon - 2013-08-01
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2013-08-06
last replythread Test results for Intel's Sandy Bridge processor - anon - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor - anon - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2013-08-08
last replythread Test results for Intel's Sandy Bridge processor - anon - 2013-08-08
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2013-08-09
last replythread Test results for Intel's Sandy Bridge processor - anon - 2013-08-09
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2013-08-10
last reply Test results for Intel's Sandy Bridge processor - Agner - 2013-08-10
last replythread Test results for Intel's Sandy Bridge processor - John D. McCalpin - 2013-10-09
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2013-10-10
last replythread Test results for Intel's Sandy Bridge processor - John D. McCalpin - 2013-10-11
last replythread SB's L1D banks - Tacit Murky - 2013-11-03
last reply SB's L1D banks - John D. McCalpin - 2013-11-07
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2011-01-30 11:15

I have now got an opportunity to test the new Sandy Bridge processor from Intel, and the results are very interesting. There are many improvements - and few drawbacks. I have updated my manuals with the details, but let me just summarize the main findings here:

New micro-op cache
The decoders translate the CISC style instructions to RISC style micro-operations. The Sandy Bridge has a new cache for storing decoded micro-operations after the decoders, but the traditional code cache before the decoders is still there. The micro-op cache turns out to be very efficient in my tests. It is easy to obtain a throughput of 4, or even 5, instructions per clock cycle as long as the code fits into the micro-op cache.  
While the throughput is improved quite a lot for code that fits into the micro-op cache, it is not improved in situations where the critical code is too big for the micro-op cache (but not too big for the level-1 code cache). The decoders in the Sandy Bridge are almost identical to the design in previous processors with the same limitation of 16 bytes per clock cycle. The maximum throughput of 4 or 5 instructions per clock cycle is rarely obtained. The difference in performance between code that fits into the micro-op cache and code that doesn't makes the micro-op cache a precious resource. It is so important to economize the use of the micro-op cache that I would give the advice never to unroll loops.
There is one improvement in the decoders, though. It is possible to fuse two instructions into one micro-op in more cases than before. For example, an ADD or SUB instruction can be fused with a conditional jump into one micro-op. This makes it possible to make a loop where the overhead of the loop counter and exit condition is just one micro-op.
Branch prediction
The branch predictor has bigger history buffers than in previous processors, but the special loop predictor is no longer there. The misprediction penalty is somewhat shorter for code that resides in the micro-op cache.
AVX instruction set
The new AVX instruction set extends the vector registers from 128 bits to 256 bits. The floating point execution units have full 256-bit bandwidth. This means that you can do calculations on vectors of eight single-precision or four double-precision numbers with a throughput of one vector addition and one vector multiplication per clock cycle. I found that this doubled throughput is obtained only after a warm-up period of several hundred floating point operations. In the "cold" state, the throughput is only half this value, and the latencies are one or two clocks longer. My guess is that the Sandy Bridge is saving power by turning off the most expensive execution units when they are not needed, and it turns on the full execution power only when the load is heavy. This is my guess only - I have found no official mentioning of this warm-up effect.
Another advantage of the AVX instruction set is that all vector instructions now have a non-destructive version with three operands where the destination is stored in a separate register. Instead of A = A + B, we now have C = A + B, so that the value of A is not overwritten by the result. This saves a lot of register moves.
A disadvantage of the AVX instruction set is that all vector instructions now have two versions, a non-destructive AVX version and a two-operand non-AVX version, and you are not supposed to mix these two versions. If the programmer inadvertently mixes AVX and non-AVX vector instructions in the same code then there is a penalty of 70 clock cycles for each transition between the two forms. I bet that this will be a very common programming error in the future - and an error that is quite difficult to detect because the code still works, albeit slower.
 More memory ports
The Sandy Bridge has two memory read ports where previous Intel processors have only one. The maximum throughput is now 256 bits read and 128 bits write per clock cycle. The flipside of this coin is that the risk of contentions in the data cache increases when there are more memory operations per clock cycle. In my tests, it was quite difficult to maintain the maximum read and write throughput without being delayed by cache bank conflicts.
Misaligned memory operands handled efficiently
On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned. Store-to-load forwarding also works with misaligned operands in most cases.
Register read ports
Previous Intel processors have a serious - and often neglected - bottleneck in the register read ports. Ever since the Pentium Pro processor back in 1995, the Intel family 6 processors have had a limitation of 2 or 3 reads from the permanent register file per clock cycle. This bottleneck has finally been removed in the Sandy Bridge.
Zeroing instructions
An instruction that subtracts a register from itself will always give zero, regardless of the previous value of the register. This is traditionally a common way of setting a register to zero. Many modern processors recognize that this instruction doesn't have to wait for the previous value of the register. What is new in the Sandy Bridge is that it doesn't even execute this instruction. The register allocater simply allocates a new empty register for the result without even sending it to the execution units. This means that you can do four zeroing instructions per clock cycle without using any execution resources. NOPs are treated in the same efficient way without using any execution unit.
This technique is not new, actually. It has been used for many years with the FXCH instruction (exchange floating point registers). There are special reasons for resolving the FXCH instruction in the register allocater/renamer, but it is funny that this technique hasn't been extended to other uses until now. It would be obvious to use this technique for register-to-register moves too, but so far we have not seen such an application.
Data transport delay
Most modern processors have different execution unit clusters or domains for different types of data or different types of registers, e.g. integer and floating point. Many processors have a delay of one or two clocks for moving data from one such domain to another. These delays are diminished in the Sandy Bridge and in some cases completely removed. I found that it is possible to move data between integer registers and vector registers without any delay.
Writeback conflicts
When two micro-operations with different latencies run in the same execution port then they may both finish at the same time. This leads to a conflict when both need the writeback port and the result bus at the same time. Both Intel and AMD processors have this problem. The Sandy Bridge can avoid most writeback conflicts by fixing execution latencies to standard values, by allowing writeback to different execution domains simultaneously, and by delaying writeback when there is a conflict.
Floating point underflow and denormal numbers
Denormal numbers are floating point numbers that are coded in a non-normal way which is used when the value is close to underflow, according to the official IEEE 754 standard. Most processors are unable to handle floating point underflow, denormal numbers, and other special cases in the general floating point execution units. These special cases are typically handled by microcode exceptions at the cost of 150 - 200 clocks per instruction. The Sandy Bridge can handle many of these special cases in hardware without any penalty. In my tests, the cases of underflow and denormal numbers were handled just as fast as normal floating point numbers for addition, but not for multiplication.

My conclusion is that the Sandy Bridge processor has many significant improvements over previous processors. The most serious bottlenecks and weaknesses of previous processors have been removed. The micro-op cache turns out to be an important improvement for relatively small loops. Unfortunately, the poor performance of the decoders has not been improved. This remains a likely bottleneck for code that doesn't fit into the micro-op cache

The decoding of instruction lengths has been a problem in Intel processors for many years. They tried to fix the problem with the trace cache in the Pentium 4, which turned out to be a dead end street, and now the apparently more successful micro-op cache in the Sandy Bridge. AMD have solved the problem of detecting instruction lengths in their processors by marking instruction boundaries in the code cache. Intel did the same in the Pentium MMX back in 1996, and it is a mystery to me why they are not using this solution today. There would hardly be a need for the micro-op cache if they had instruction boundaries marked in the code cache.

Whenever the narrowest bottleneck of a system is removed then the next less narrow bottleneck becomes visible. This is also the case here. As the memory read bandwidth is doubled, the risk of cache bank conflicts is increased. Cache conflicts was actually the limiting factor in some of my tests.

It has struck me that the new Sandy Bridge design is actually under-hyped. I would expect a new processor design with so many improvements to be advertised aggressively, but the new design doesn't even have an official brand name. The name Sandy Bridge is only an unofficial code name. In Intel documents it is variously referred to as "second generation Intel Core processors", "2xxx series", and "Intel microarchitecture code name Sandy Bridge". I have never understood what happens in Intel's marketing department. They keep changing their nomenclature, and they use the same brand names for radically different technical designs. In this case they have no reason to obscure technical differences. How can they cash in on the good reputation of the Sandy Bridge design when it doesn't even have an official name?

[Corrected on June 08, 2011, and Mar 2, 2012].

Test results for Intel's Sandy Bridge processor
Author:  Date: 2011-02-15 11:09
Hi Agner - have you noticed that Turbo Boost is much more effective on Sandy Bridge than on Nehalem ? On a 4 core 3.4 GHz SB with all cores working flat out I'm seeing the clock speed staying at 4.3 GHz for 20+ minutes. This seems to suggest that there is 25% extra performance to be had for free, so long as you have sufficient cooling.
Author: phis Date: 2011-06-23 01:13
Thanks for your detailed analysis, this is very useful indeed.

Have you seen the updated Intel Advanced Vector Extensions Programming Reference (June 2011). There are interesting things in there, including AVX2 (256-bit integer AVX instructions) and some VEX-encoded general-purpose instruction for bit manipulation et al.

Author: Agner Date: 2011-06-23 11:35
Thanks for the reference. I always expected that there would be an AVX2 with 256 bit integer vector instructions.

The most surprising extension is the VGATHER.. instructions that allow vectorized table-lookup. Lookup tables have always been an obstackle to vectorization. I wonder how efficient it will be, though. The performance will still be limited by the number of address-generation units and read ports in the CPU.

The physical random number generator instruction (RDRAND) has been announced previously. It is strongly needed for cryptographic and security applications. The VIA processors have had such an instruction for years now.

I will update my "objconv" disassembler with the new instructions when I get the time.

Test results for Intel's Sandy Bridge processor
Author: anon Date: 2013-08-01 06:26
Thank you very much for the good analysis.

There is one restriction that isn't mentioned in your document. In Sandy Bridge and later processors, instructions that Macro-op fusion can be applied (add, sub, and, cmp, test, inc, dec) seem to be decoded only with simple decoders (3 of 4). This restriction does not exist in Nehalem or earlier processors.

Actually there is a decoded uop cache, and OoO backend executes these instruction in 3 per cycle throughput, so it would have little impact on the real world performance. But it might be a bit different story in Haswell, which has wider execution ports.

Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-08-06 01:47
How do you know, Anon?
Please don't post unverified claims anonymously.
Test results for Intel's Sandy Bridge processor
Author: anon Date: 2013-08-07 07:19
500 iterations of this code sequence (4,000 instructions, does not fit to uop cache):

or rax, 1
or rdx, 1
or rsi, 1
or rdi, 1
or r8, 1
or r9, 1
movaps xmm0, [r10]
movaps xmm1, [r11]

runs at 2 clocks / 8 instructions (as expected). But if we change 6 ORs into AND(or other macro-fusable instructions), it drops to 2.5 clocks / 8 instructions.

It means that the decoder cannot handle four macro-fusable instructions at the same clock cycle.

Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-08-07 11:11
anon wrote:
But if we change 6 ORs into AND(or other macro-fusable instructions), it drops to 2.5 clocks / 8 instructions. It means that the decoder cannot handle four macro-fusable instructions at the same clock cycle.
I get 2.45 clock on an Ivy Bridge. I get the same for NOT and NEG, which are not fusable. There is nothing the instructions can actually fuse with, though.
Test results for Intel's Sandy Bridge processor
Author: anon Date: 2013-08-07 11:49
Agner wrote:
I get the same for NOT and NEG, which are not fusable.
Repeating 6 not/neg (2 or 3 bytes x 6) will be affected by predecoder's limitation. To avoid that, this code sequence will be helpful:

not rax
not rdx
not rsi
not rdi
or r8, 1
or r9, 1
movaps xmm0, [r10]
movaps xmm1, [r11]

This runs at 2 clocks / 8 insts. But

and rax, rax
and rdx, rdx
and rsi, rsi
and rdi, rdi
or r8, 1
or r9, 1
movaps xmm0, [r10]
movaps xmm1, [r11]

this doesn't.

Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-08-08 01:38
anon wrote:
Repeating 6 not/neg (2 or 3 bytes x 6) will be affected by predecoder's limitation.
Is there a limitation on decoding short instructions? Is this documented anywhere?
I have observed on the Haswell that conditional move instructions, which generate 2 microops, decode at two per clock only when I add prefixes to make the instructions 4 bytes long. This applies also when the microop cache is used.
Test results for Intel's Sandy Bridge processor
Author: anon Date: 2013-08-08 04:56
Agner wrote:
Is there a limitation on decoding short instructions? Is this documented anywhere?
I'm not sure if it really is predecoder's limitation. For example,

or reg, reg
or reg, reg
or reg, reg
mov reg, [reg]

This code sequence should ideally run at 1 clock / 4 instructions. When I change the instruction length from 2 to 4 bytes using these variants:

or r32, r32 : 2B OR
or r64, r64 : 3B OR
or r64, 1 : 4B OR
mov r32, [reg] : 2B MOV
mov r64, [reg] : 3B MOV
mov r64, [reg+8] : 4B MOV

The results are:

inst.    clock/4insts.
pattern  $miss $hit
-------- ----- -----
2+2+2+2  1.0,  1.0
3+2+2+2  1.13  1.13
3+3+2+2  1.25  1.19
3+3+3+2  1.31  1.0
3+3+3+3  1.21  1.15
4+3+3+3  1.16  1.0
4+4+3+3  1.0   1.10
4+4+4+3  1.0   1.16
4+4+4+4  1.0   1.0

So it seems there are some limitations regarding instruction count in 16B (or larger) code block, for both legacy decoder and uop cache.
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-08-09 01:53
This looks like an alignment issue. The code is fetched in 16-bytes blocks. Instructions that cross a 16-bytes boundary (or 32-bytes boundary?) are decoded less efficiently. The 發p cache is coupled to the instruction cache with a maximum of three 6-發p entries per 32 bytes block of code. How this translates to inefficiency when instructions with certain lengths execute out of the 發p cache, I don't really understand.

I have done some experiments to test your claim that fuseable instructions decode less efficiently:

xchg r8,r9    ; 3 發ps. Decodes alone
or eax,eax    ; 1 發p, D0
or ebx,ebx    ; 1 發p, D1
or ecx,ecx    ; 1 發p, D2
or edx,edx    ; 1 發p, D3
This decodes in 2 clocks. If the last OR is changed to an AND, it decodes in 3 clocks. It will not put a fuseable arithmetic/logic instruction in decoder D3 because then it can't check in the same clock cycle if the next instruction is a branch. There is no effect when this executes out of the 發p cache.
Test results for Intel's Sandy Bridge processor
Author: anon Date: 2013-08-09 04:50
Interesting. So it sounds like the odd rule also exists in the uop cache territory?

Here is another example:

or rax, 1
or rdx, 1
or rsi, 1
movaps xmm0, [r10]
or rdi, 1
or r8, 1
movaps xmm1, [r11]
or r9, 1

This runs at 2 clocks / 8 instructions regardless of uop cache hit/miss. But if all ORs are changed into AND, it drops to 2.45 clocks / 8 instructions when the code isn't fit into the uop cache.

Of course,

and rax, 1
and rdx, 1
and rsi, 1
movaps xmm0, [r10]
and rdi, 1
and r8, 1
and r9, 1
movaps xmm1, [r11]

This runs at 2 clocks / 8 instructions without problem.

The result means not only that decode throughput of AND instruction is limited to 3 / cycle, but also that 4-1-1-1 pattern rule is applied to the instruction. This makes me believe that macro-fuseable instructions are only handled in simple decoders.

Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-08-10 01:44
There are at least two different issues here. One is, as you suggested, that the fuseable instructions don't go into the last decoder. The other is that short instructions don't go into the 發p cache if they generate a total of more than 18 發ps per 32 bytes of code. Maybe there is also an alignment issue. We will have to do some more experiments to test this. You can easily make instructions longer (up to 15 bytes) by adding dummy segment prefixes ( db 3EH ).
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-08-10 05:52
Now I have done some tests of the alignment effects. This explains the weird results I have seen earlier where the performance was improved when some instructions was made longer.
mov ebp, 100
align 32
%rep 100 
                     ;  uops  Bytes
cmove eax,eax        ;    2      3
cmove ebx,ebx        ;    2      3
xchg  r8,r9          ;    3      3
nop7                 ;    1      7
nop7                 ;    1      7
nop8                 ;    1      8
nop                  ;    1      1
;                Total:  11     32
dec ebp
jnz LL

This takes almost 4 clocks. When I add a nop after align 32 to change the alignment by one byte, it takes only 3 clocks. The explanation is this. Each 發p cache line can take 6 發ps. The first two instructions take one 發p cache line. The xchg instruction cannot cross a cache line so it starts in a new cache line. The next three instructions go in the same line, and the last nop takes a third line. Then there is a 32-bytes boundary and we start a new cache line. In total we need 300 cache lines, and there are only 256 lines in the 發p cache. The loop doesn't fit into the 發p cache, so the decoders become the bottleneck. When the alignment is changed, the last nop goes together with the two cmove instructions in the next iteration, and we need only 200 cache lines. Now it fits into the 發p cache and the speed goes up. The same can be obtained by lowering the repeat count.

Test results for Intel's Sandy Bridge processor
Author:  Date: 2013-10-09 13:14
First -- thank you very much for your performance work -- this is by far the most comprehensive and accurate compilation of microarchitecture and performance data that I have been able to find since I left the AMD processor design team at the end of 2008, and it has been very helpful in my ongoing studies of core performance issues. (Most of my prior work has been on memory systems and coherence protocols -- e.g., -- but now I am trying to learn more about core microarchitecture and performance and power).

This note concerns the L1 Data Cache Banking on Intel's Sandy Bridge (and presumably Ivy Bridge) processors.

Intel's Performance Optimization reference manual (document 248966-028, July 2013) says that Sandy Bridge cores will have an L1 Data Cache bank conflict if two loads ready for issue in the same cycle to two different cache lines in the same cache set have matching bits 5:2.
Actually they say bits 4:2 in section (page 2-20) and bits 5:2 in section (page 3-43), but an Intel employee confirmed that the latter was correct in a forum post at

This seems odd, since 5:2 is four bits and they are clear in reporting that there are only 8 banks. In the forum posts, the Intel employees were clearly not being permitted to disclose the full details, so my curiosity was aroused.

The example code that they provide in section (example 3-37) attempts to load two 32-bit items from the same offset within two different cache lines mapping to the same cache set. This does demonstrate bank conflicts, but not very many. (The loads can dual-issue after the first cycle -- so the code takes 5 cycles to perform the 8 loads instead of 4 cycles.) Repeating the loop a million times and using performance counter event BFh, umask 05h: L1D_BLOCKS.BANK_CONFLICT_CYCLES confirmed the stalls.

Unfortunately the "corrected" version that they provide does not demonstrate that a difference in bits 5:2 will avoid a bank conflict.
Instead, it demonstrates that loading two adjacent 32 bit values from the *same* cache line results in no conflicts -- which is no surprise at all.

So I built a code similar to their example except that all 8 loads were to the same offset of 8 different cache lines that mapped to the same cache set. This gave a measured bank conflict rate close to my estimate of 7/8 (since there is no stall counted for the first of the 8 loads and the conflict continues for all loads after the first.)

Then I modified the offsets so that the 8 loads were to consecutive 32-bit locations in 8 different cache lines that mapped to the same set. I.e., a stride of 17 32-bit words instead of 16 32-bit words. This gave zero conflicts and directly confirms that a difference in address bit 2 is enough to prevent a bank conflict (at least for 32-bit loads). That is quite an interesting result because it does not fit easily into the model of a cache having 64-bit wide or 128-bit wide banks (as you suggest in section 9.13 of your microarchitecture reference guide).

My current hypothesis is that the cache has 8 banks that are each 32 bits wide, but run at twice the processor core frequency -- giving an effective width of 64 bits, but a granularity of access of 32 bits -- almost the same as having 16 banks. The main idea is that each bank can accept two addresses per cycle and deliver two 32-bit results from different lines, but with the critical limitation that it can only deliver the low-order 32-bits in the first half-cycle and can only deliver the high-order 32-bits in the second half-cycle. This combination of features is the only mechanism I could think of that retains the bank conflict seen when bits 5:2 match but which allows dual issue when bits 5:2 differ.

Technologically, a double-speed cache appears possible -- experiments with the CACTI cache simulator ( suggest that a 32 KiB cache of similar configuration should be able to run at up to about 7.5 GHz in a 32nm process technology, with an area similar to what I estimate from the Sandy Bridge die photos.

I have reviewed many of the other possible combinations of alignment for a pair of loads and my hypothesis appears to provide a plausible explanation of the observed behavior. There are some problematic cases with combinations that include a 128-bit load on a 32-bit boundary where my model suggests a bank conflict even when bits 5:2 differ, but I am not sure that Intel's claims about the ability to dual-issue are intended to cover all such misalignments (and I have not coded any of these cases to see if they actually generate bank conflicts).

This is part of work that I am doing to develop a set of microbenchmarks that can be used to document the behavior of hardware performance counters so that I can have some hope of using them to understand application characteristics. I have not had time to review your latency and throughput test codes yet, but I hope that with some modification (mostly controlling where the data is located when data motion instructions are executed) they will be useful in illuminating the specifics of what the performance counters are actually counting....

Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-10-10 10:34
Thank you for your comments John.

I think it is unlikely that the cache could be running at double clock frequency. It is too big for that. Some previous models have run the cache at half clock frequency. Maybe your observations have something to do with the fact that the Sandy Bridge has two read ports?

Have you tried on Haswell? It should have no cache bank conflicts

Test results for Intel's Sandy Bridge processor
Author:  Date: 2013-10-11 19:11
I am not an SRAM expert, but my experiments with CACTI suggest that a double-speed 32 KiB cache is possible in a 32 nm process. It is certainly possible that I am misunderstanding the results.

It seems to me that the available documentation leaves a lot of uncertainty about how the cache ports relate to the cache SRAM banks. Intel's comments that a 16-Byte read may access as many as three banks strongly implies that the banks are 8 Bytes wide. Another possibility is that the addresses are "swizzled" somehow, but I have been unable to come up with a swizzling scheme that matches Intel's descriptions or my observations. Still a possibility, of course. I did not work on the core microarchitecture when I was at AMD, but my impression was that aligned 16 Byte loads in the Family 10h processors were serviced by a single bank.

We don't have any Haswell systems at TACC --- I think that it will be the second half of next year before the two-socket Haswell-based servers are available. We will probably have access a bit earlier. I also read Intel's comments about the absence of bank conflicts in Haswell, and am looking forward to testing the new technology.

SB's L1D banks
Author: Tacit Murky Date: 2013-11-03 03:29
Hello, John.

In our ( low-level tests we have confirmed that L1D have 8-byte banks (that was also confirmed by SB arch. team engineer) with 5:3 bits allinged. Solving 4-byte accesses case is easy: OoO mem access (Intel term: MD) will reorder reads to issue them in different banks A+0 & A+8, then A+4 & A+12, then same for next 4 reads and 2 banks, etc. (A = line's address). Also, delaying 1st access (having a 剃onflict event for PMC), it's possible to issue all other loads without reordering, still having different banks: A+0 & (none), A+4 & A+8, A+12 & A+16

DDR for cache bit-lines is possible but removes possibility of (practically need for) precharge. Without precharge bit-lines will have to swing 0<=>1 and back up to twice per clock. That requires fast (HT) transistors with high parameter uniformity (a Big Problem for 45 nm and bellow) and, most important, will ruin performance/watt metric for such a cache. And both Intel & AMD are avoiding this at all costs like converting 6T bit-cells to 8T (for L1's and L2's) just to save power.

But I'm still curious: how Intel resolved bank conflicts in Haswell. Naive solution is to make all banks 3-ported (2R+W), that would require 10T-cells. But early die-shots show just slighly larger L1D area cf. IB with same aspect ratio. Hm?

While we're at it, can I ask why AMD's memory controllers are so slow, especially on writes? Never can they achieve even 50% of theoretical peak throughput. Intel can do more. See AIDA64 剃ache & memory benchmark results, like this: .

SB's L1D banks
Author:  Date: 2013-11-07 16:40
Thanks to Tacit Murky for the comments. I like the reordering trick, but it only works if you have accesses to different banks that can be re-ordered. In my original analysis I did not make this assumption. Consider, for example, performing a dot product on vectors of 32-bit values, each with a stride of 64B, and with a modulo-64 offset of 4 Bytes. Every load will access the same bank, so I think this case will have lots of conflicts, but every pair of loads differs in bit 2, so the pairs do not match in bits 5:2 and therefore (according to the wording of the optimization reference manual section, page 3-43) should *not* experience bank conflicts.

I had intended to test this particular case, but now that I look at my code I see that my code with offsets does roll over all of the banks (using a stride of 68 Bytes), so the reordering trick is sufficient to explain the observed drop in bank conflicts.

Concerning the write bandwidth on the AMD processors: Recent Intel processors (Nehalem & newer) have 10 "Line Fill Buffers" per core, and use these for streaming stores. In contrast, the AMD Family 10h processors have 8 "Miss Address Buffers" that are used for cacheable L1 misses (load or store) and 4 separate "Write Combining Buffers" that are used for streaming stores. This gives the AMD Family 10h processor significantly less potential concurrency for stores. Unfortunately it is quite difficult to estimate the amount of time that a buffer needs to be occupied for a streaming store operation, so it is not obvious how to determine whether the streaming store performance is concurrency-limited. In both AMD and Intel systems, the buffers used by the cores to handle streaming stores will hand off the data to the memory controller at some point, so they will probably have shorter occupancy than what is required for reads (since the buffers have to track reads for the full round trip), but the specifics of the hand-off are going to be implementation dependent and I don't see any obvious methodology for estimating occupancy. Once the streaming stores have been handed off to memory controller queues things are even less clear, since the number of buffers in the memory controller does not appear to be documented, and the occupancy in those buffers will depend on details of the coherence protocol that are unlikely to be discussed in public.

A brief look at the BIOS and Kernel Developer's Guide for the AMD Family 15h processors suggests that the cache miss buffer architecture has been changed significantly, but I have not worked through the details. I did find a note in AMD's Software Optimization Guide for Family 15h Processors (publication 47414, revision 3.06, January 2012) that says that Family 15h processors have about the same speed as Family 10h processors when writing a single write-combining stream, but may be slower when writing more than one write-combining stream. I have a few Family 15h boxes in my benchmarking cluster, but since our production systems are all currently Intel-based, I have not had much motivation to research the confusing bandwidth numbers that I obtained in my initial testing.