Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Test results for Intel's Sandy Bridge processor - Agner - 2011-01-30
reply Test results for Intel's Sandy Bridge processor - PaulR - 2011-02-15
replythread AVX2 - phis - 2011-06-23
last reply AVX2 - Agner - 2011-06-23
replythread Test results for Intel's Sandy Bridge processor - anon - 2013-08-01
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2013-08-06
last replythread Test results for Intel's Sandy Bridge processor - anon - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor - anon - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2013-08-08
last replythread Test results for Intel's Sandy Bridge processor - anon - 2013-08-08
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2013-08-09
last replythread Test results for Intel's Sandy Bridge processor - anon - 2013-08-09
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2013-08-10
last reply Test results for Intel's Sandy Bridge processor - Agner - 2013-08-10
replythread Test results for Intel's Sandy Bridge processor - John D. McCalpin - 2013-10-09
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2013-10-10
last replythread Test results for Intel's Sandy Bridge processor - John D. McCalpin - 2013-10-11
last replythread SB's L1D banks - Tacit Murky - 2013-11-03
last reply SB's L1D banks - John D. McCalpin - 2013-11-07
replythread Test results for Intel's Sandy Bridge processor - John D. McCalpin - 2015-08-18
replythread Test results for Intel's Sandy Bridge processor - Agner - 2015-08-18
last replythread Test results for Intel's Sandy Bridge processor - John D. McCalpin - 2015-08-24
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2015-08-25
last reply Test results for Intel's Sandy Bridge processor - John D. McCalpin - 2015-08-25
replythread Haswell upper128 power gating - Peter Cordes - 2015-08-28
last replythread Haswell upper128 power gating - Agner - 2016-01-16
last replythread Haswell upper128 power gating - John D. McCalpin - 2016-01-29
last reply Haswell upper128 power gating - Agner - 2016-01-30
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2015-12-20
last replythread Test results for Intel's Sandy Bridge processor - John D. McCalpin - 2015-12-21
last replythread Test results for Intel's Sandy Bridge processor - Agner - 2015-12-22
reply Test results for Intel's Sandy Bridge processor - Robert - 2015-12-24
last replythread Test results for Intel's Sandy Bridge processor - Just_Coder - 2015-12-25
last reply Test results for Intel's Sandy Bridge processor - Agner - 2015-12-26
last replythread Test results for Intel's Sandy Bridge processor - Just_Coder - 2015-08-23
last reply Test results for Intel's Sandy Bridge processor - Agner - 2015-08-25
 
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2011-01-30 11:15

I have now got an opportunity to test the new Sandy Bridge processor from Intel, and the results are very interesting. There are many improvements - and few drawbacks. I have updated my manuals with the details, but let me just summarize the main findings here:

New micro-op cache
The decoders translate the CISC style instructions to RISC style micro-operations. The Sandy Bridge has a new cache for storing decoded micro-operations after the decoders, but the traditional code cache before the decoders is still there. The micro-op cache turns out to be very efficient in my tests. It is easy to obtain a throughput of 4, or even 5, instructions per clock cycle as long as the code fits into the micro-op cache.  
  
Decoders
While the throughput is improved quite a lot for code that fits into the micro-op cache, it is not improved in situations where the critical code is too big for the micro-op cache (but not too big for the level-1 code cache). The decoders in the Sandy Bridge are almost identical to the design in previous processors with the same limitation of 16 bytes per clock cycle. The maximum throughput of 4 or 5 instructions per clock cycle is rarely obtained. The difference in performance between code that fits into the micro-op cache and code that doesn't makes the micro-op cache a precious resource. It is so important to economize the use of the micro-op cache that I would give the advice never to unroll loops.
  
Macro-fusion
There is one improvement in the decoders, though. It is possible to fuse two instructions into one micro-op in more cases than before. For example, an ADD or SUB instruction can be fused with a conditional jump into one micro-op. This makes it possible to make a loop where the overhead of the loop counter and exit condition is just one micro-op.
  
Branch prediction
The branch predictor has bigger history buffers than in previous processors, but the special loop predictor is no longer there. The misprediction penalty is somewhat shorter for code that resides in the micro-op cache.
  
AVX instruction set
The new AVX instruction set extends the vector registers from 128 bits to 256 bits. The floating point execution units have full 256-bit bandwidth. This means that you can do calculations on vectors of eight single-precision or four double-precision numbers with a throughput of one vector addition and one vector multiplication per clock cycle. I found that this doubled throughput is obtained only after a warm-up period of several hundred floating point operations. In the "cold" state, the throughput is only half this value, and the latencies are one or two clocks longer. My guess is that the Sandy Bridge is saving power by turning off the most expensive execution units when they are not needed, and it turns on the full execution power only when the load is heavy. This is my guess only - I have found no official mentioning of this warm-up effect.
  
Another advantage of the AVX instruction set is that all vector instructions now have a non-destructive version with three operands where the destination is stored in a separate register. Instead of A = A + B, we now have C = A + B, so that the value of A is not overwritten by the result. This saves a lot of register moves.
  
A disadvantage of the AVX instruction set is that all vector instructions now have two versions, a non-destructive AVX version and a two-operand non-AVX version, and you are not supposed to mix these two versions. If the programmer inadvertently mixes AVX and non-AVX vector instructions in the same code then there is a penalty of 70 clock cycles for each transition between the two forms. I bet that this will be a very common programming error in the future - and an error that is quite difficult to detect because the code still works, albeit slower.
  
 More memory ports
The Sandy Bridge has two memory read ports where previous Intel processors have only one. The maximum throughput is now 256 bits read and 128 bits write per clock cycle. The flipside of this coin is that the risk of contentions in the data cache increases when there are more memory operations per clock cycle. In my tests, it was quite difficult to maintain the maximum read and write throughput without being delayed by cache bank conflicts.
  
Misaligned memory operands handled efficiently
On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned. Store-to-load forwarding also works with misaligned operands in most cases.
  
Register read ports
Previous Intel processors have a serious - and often neglected - bottleneck in the register read ports. Ever since the Pentium Pro processor back in 1995, the Intel family 6 processors have had a limitation of 2 or 3 reads from the permanent register file per clock cycle. This bottleneck has finally been removed in the Sandy Bridge.
  
Zeroing instructions
An instruction that subtracts a register from itself will always give zero, regardless of the previous value of the register. This is traditionally a common way of setting a register to zero. Many modern processors recognize that this instruction doesn't have to wait for the previous value of the register. What is new in the Sandy Bridge is that it doesn't even execute this instruction. The register allocater simply allocates a new empty register for the result without even sending it to the execution units. This means that you can do four zeroing instructions per clock cycle without using any execution resources. NOPs are treated in the same efficient way without using any execution unit.
 
This technique is not new, actually. It has been used for many years with the FXCH instruction (exchange floating point registers). There are special reasons for resolving the FXCH instruction in the register allocater/renamer, but it is funny that this technique hasn't been extended to other uses until now. It would be obvious to use this technique for register-to-register moves too, but so far we have not seen such an application.
  
Data transport delay
Most modern processors have different execution unit clusters or domains for different types of data or different types of registers, e.g. integer and floating point. Many processors have a delay of one or two clocks for moving data from one such domain to another. These delays are diminished in the Sandy Bridge and in some cases completely removed. I found that it is possible to move data between integer registers and vector registers without any delay.
  
Writeback conflicts
When two micro-operations with different latencies run in the same execution port then they may both finish at the same time. This leads to a conflict when both need the writeback port and the result bus at the same time. Both Intel and AMD processors have this problem. The Sandy Bridge can avoid most writeback conflicts by fixing execution latencies to standard values, by allowing writeback to different execution domains simultaneously, and by delaying writeback when there is a conflict.
  
Floating point underflow and denormal numbers
Denormal numbers are floating point numbers that are coded in a non-normal way which is used when the value is close to underflow, according to the official IEEE 754 standard. Most processors are unable to handle floating point underflow, denormal numbers, and other special cases in the general floating point execution units. These special cases are typically handled by microcode exceptions at the cost of 150 - 200 clocks per instruction. The Sandy Bridge can handle many of these special cases in hardware without any penalty. In my tests, the cases of underflow and denormal numbers were handled just as fast as normal floating point numbers for addition, but not for multiplication.

My conclusion is that the Sandy Bridge processor has many significant improvements over previous processors. The most serious bottlenecks and weaknesses of previous processors have been removed. The micro-op cache turns out to be an important improvement for relatively small loops. Unfortunately, the poor performance of the decoders has not been improved. This remains a likely bottleneck for code that doesn't fit into the micro-op cache

The decoding of instruction lengths has been a problem in Intel processors for many years. They tried to fix the problem with the trace cache in the Pentium 4, which turned out to be a dead end street, and now the apparently more successful micro-op cache in the Sandy Bridge. AMD have solved the problem of detecting instruction lengths in their processors by marking instruction boundaries in the code cache. Intel did the same in the Pentium MMX back in 1996, and it is a mystery to me why they are not using this solution today. There would hardly be a need for the micro-op cache if they had instruction boundaries marked in the code cache.

Whenever the narrowest bottleneck of a system is removed then the next less narrow bottleneck becomes visible. This is also the case here. As the memory read bandwidth is doubled, the risk of cache bank conflicts is increased. Cache conflicts was actually the limiting factor in some of my tests.

It has struck me that the new Sandy Bridge design is actually under-hyped. I would expect a new processor design with so many improvements to be advertised aggressively, but the new design doesn't even have an official brand name. The name Sandy Bridge is only an unofficial code name. In Intel documents it is variously referred to as "second generation Intel Core processors", "2xxx series", and "Intel microarchitecture code name Sandy Bridge". I have never understood what happens in Intel's marketing department. They keep changing their nomenclature, and they use the same brand names for radically different technical designs. In this case they have no reason to obscure technical differences. How can they cash in on the good reputation of the Sandy Bridge design when it doesn't even have an official name?

[Corrected on June 08, 2011, and Mar 2, 2012].

   
Test results for Intel's Sandy Bridge processor
Author:  Date: 2011-02-15 11:09
Hi Agner - have you noticed that Turbo Boost is much more effective on Sandy Bridge than on Nehalem ? On a 4 core 3.4 GHz SB with all cores working flat out I'm seeing the clock speed staying at 4.3 GHz for 20+ minutes. This seems to suggest that there is 25% extra performance to be had for free, so long as you have sufficient cooling.
   
AVX2
Author: phis Date: 2011-06-23 01:13
Thanks for your detailed analysis, this is very useful indeed.

Have you seen the updated Intel Advanced Vector Extensions Programming Reference (June 2011). There are interesting things in there, including AVX2 (256-bit integer AVX instructions) and some VEX-encoded general-purpose instruction for bit manipulation et al.

   
AVX2
Author: Agner Date: 2011-06-23 11:35
Thanks for the reference. I always expected that there would be an AVX2 with 256 bit integer vector instructions.

The most surprising extension is the VGATHER.. instructions that allow vectorized table-lookup. Lookup tables have always been an obstackle to vectorization. I wonder how efficient it will be, though. The performance will still be limited by the number of address-generation units and read ports in the CPU.

The physical random number generator instruction (RDRAND) has been announced previously. It is strongly needed for cryptographic and security applications. The VIA processors have had such an instruction for years now.

I will update my "objconv" disassembler with the new instructions when I get the time.

   
Test results for Intel's Sandy Bridge processor
Author: anon Date: 2013-08-01 06:26
Thank you very much for the good analysis.

There is one restriction that isn't mentioned in your document. In Sandy Bridge and later processors, instructions that Macro-op fusion can be applied (add, sub, and, cmp, test, inc, dec) seem to be decoded only with simple decoders (3 of 4). This restriction does not exist in Nehalem or earlier processors.

Actually there is a decoded uop cache, and OoO backend executes these instruction in 3 per cycle throughput, so it would have little impact on the real world performance. But it might be a bit different story in Haswell, which has wider execution ports.

   
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-08-06 01:47
How do you know, Anon?
Please don't post unverified claims anonymously.
   
Test results for Intel's Sandy Bridge processor
Author: anon Date: 2013-08-07 07:19
500 iterations of this code sequence (4,000 instructions, does not fit to uop cache):

or rax, 1
or rdx, 1
or rsi, 1
or rdi, 1
or r8, 1
or r9, 1
movaps xmm0, [r10]
movaps xmm1, [r11]

runs at 2 clocks / 8 instructions (as expected). But if we change 6 ORs into AND(or other macro-fusable instructions), it drops to 2.5 clocks / 8 instructions.

It means that the decoder cannot handle four macro-fusable instructions at the same clock cycle.

   
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-08-07 11:11
anon wrote:
But if we change 6 ORs into AND(or other macro-fusable instructions), it drops to 2.5 clocks / 8 instructions. It means that the decoder cannot handle four macro-fusable instructions at the same clock cycle.
I get 2.45 clock on an Ivy Bridge. I get the same for NOT and NEG, which are not fusable. There is nothing the instructions can actually fuse with, though.
   
Test results for Intel's Sandy Bridge processor
Author: anon Date: 2013-08-07 11:49
Agner wrote:
I get the same for NOT and NEG, which are not fusable.
Repeating 6 not/neg (2 or 3 bytes x 6) will be affected by predecoder's limitation. To avoid that, this code sequence will be helpful:

not rax
not rdx
not rsi
not rdi
or r8, 1
or r9, 1
movaps xmm0, [r10]
movaps xmm1, [r11]

This runs at 2 clocks / 8 insts. But

and rax, rax
and rdx, rdx
and rsi, rsi
and rdi, rdi
or r8, 1
or r9, 1
movaps xmm0, [r10]
movaps xmm1, [r11]

this doesn't.

   
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-08-08 01:38
anon wrote:
Repeating 6 not/neg (2 or 3 bytes x 6) will be affected by predecoder's limitation.
Is there a limitation on decoding short instructions? Is this documented anywhere?
I have observed on the Haswell that conditional move instructions, which generate 2 microops, decode at two per clock only when I add prefixes to make the instructions 4 bytes long. This applies also when the microop cache is used.
   
Test results for Intel's Sandy Bridge processor
Author: anon Date: 2013-08-08 04:56
Agner wrote:
Is there a limitation on decoding short instructions? Is this documented anywhere?
I'm not sure if it really is predecoder's limitation. For example,

or reg, reg
or reg, reg
or reg, reg
mov reg, [reg]

This code sequence should ideally run at 1 clock / 4 instructions. When I change the instruction length from 2 to 4 bytes using these variants:

or r32, r32 : 2B OR
or r64, r64 : 3B OR
or r64, 1 : 4B OR
mov r32, [reg] : 2B MOV
mov r64, [reg] : 3B MOV
mov r64, [reg+8] : 4B MOV

The results are:

inst.    clock/4insts.
pattern  $miss $hit
-------- ----- -----
2+2+2+2  1.0,  1.0
3+2+2+2  1.13  1.13
3+3+2+2  1.25  1.19
3+3+3+2  1.31  1.0
3+3+3+3  1.21  1.15
4+3+3+3  1.16  1.0
4+4+3+3  1.0   1.10
4+4+4+3  1.0   1.16
4+4+4+4  1.0   1.0

So it seems there are some limitations regarding instruction count in 16B (or larger) code block, for both legacy decoder and uop cache.
   
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-08-09 01:53
This looks like an alignment issue. The code is fetched in 16-bytes blocks. Instructions that cross a 16-bytes boundary (or 32-bytes boundary?) are decoded less efficiently. The µop cache is coupled to the instruction cache with a maximum of three 6-µop entries per 32 bytes block of code. How this translates to inefficiency when instructions with certain lengths execute out of the µop cache, I don't really understand.

I have done some experiments to test your claim that fuseable instructions decode less efficiently:

xchg r8,r9    ; 3 µops. Decodes alone
or eax,eax    ; 1 µop, D0
or ebx,ebx    ; 1 µop, D1
or ecx,ecx    ; 1 µop, D2
or edx,edx    ; 1 µop, D3
This decodes in 2 clocks. If the last OR is changed to an AND, it decodes in 3 clocks. It will not put a fuseable arithmetic/logic instruction in decoder D3 because then it can't check in the same clock cycle if the next instruction is a branch. There is no effect when this executes out of the µop cache.
   
Test results for Intel's Sandy Bridge processor
Author: anon Date: 2013-08-09 04:50
Interesting. So it sounds like the odd rule also exists in the uop cache territory?

Here is another example:

or rax, 1
or rdx, 1
or rsi, 1
movaps xmm0, [r10]
or rdi, 1
or r8, 1
movaps xmm1, [r11]
or r9, 1

This runs at 2 clocks / 8 instructions regardless of uop cache hit/miss. But if all ORs are changed into AND, it drops to 2.45 clocks / 8 instructions when the code isn't fit into the uop cache.

Of course,

and rax, 1
and rdx, 1
and rsi, 1
movaps xmm0, [r10]
and rdi, 1
and r8, 1
and r9, 1
movaps xmm1, [r11]

This runs at 2 clocks / 8 instructions without problem.

The result means not only that decode throughput of AND instruction is limited to 3 / cycle, but also that 4-1-1-1 pattern rule is applied to the instruction. This makes me believe that macro-fuseable instructions are only handled in simple decoders.

   
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-08-10 01:44
There are at least two different issues here. One is, as you suggested, that the fuseable instructions don't go into the last decoder. The other is that short instructions don't go into the µop cache if they generate a total of more than 18 µops per 32 bytes of code. Maybe there is also an alignment issue. We will have to do some more experiments to test this. You can easily make instructions longer (up to 15 bytes) by adding dummy segment prefixes ( db 3EH ).
   
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-08-10 05:52
Now I have done some tests of the alignment effects. This explains the weird results I have seen earlier where the performance was improved when some instructions was made longer.
mov ebp, 100
align 32
LL:
%rep 100 
                     ;  uops  Bytes
cmove eax,eax        ;    2      3
cmove ebx,ebx        ;    2      3
xchg  r8,r9          ;    3      3
nop7                 ;    1      7
nop7                 ;    1      7
nop8                 ;    1      8
nop                  ;    1      1
;                Total:  11     32
%endrep
dec ebp
jnz LL

This takes almost 4 clocks. When I add a nop after align 32 to change the alignment by one byte, it takes only 3 clocks. The explanation is this. Each µop cache line can take 6 µops. The first two instructions take one µop cache line. The xchg instruction cannot cross a cache line so it starts in a new cache line. The next three instructions go in the same line, and the last nop takes a third line. Then there is a 32-bytes boundary and we start a new cache line. In total we need 300 cache lines, and there are only 256 lines in the µop cache. The loop doesn't fit into the µop cache, so the decoders become the bottleneck. When the alignment is changed, the last nop goes together with the two cmove instructions in the next iteration, and we need only 200 cache lines. Now it fits into the µop cache and the speed goes up. The same can be obtained by lowering the repeat count.

   
Test results for Intel's Sandy Bridge processor
Author:  Date: 2013-10-09 13:14
First -- thank you very much for your performance work -- this is by far the most comprehensive and accurate compilation of microarchitecture and performance data that I have been able to find since I left the AMD processor design team at the end of 2008, and it has been very helpful in my ongoing studies of core performance issues. (Most of my prior work has been on memory systems and coherence protocols -- e.g., www.cs.virginia.edu/stream/ -- but now I am trying to learn more about core microarchitecture and performance and power).

This note concerns the L1 Data Cache Banking on Intel's Sandy Bridge (and presumably Ivy Bridge) processors.

Intel's Performance Optimization reference manual (document 248966-028, July 2013) says that Sandy Bridge cores will have an L1 Data Cache bank conflict if two loads ready for issue in the same cycle to two different cache lines in the same cache set have matching bits 5:2.
Actually they say bits 4:2 in section 2.2.5.2 (page 2-20) and bits 5:2 in section 3.6.1.3 (page 3-43), but an Intel employee confirmed that the latter was correct in a forum post at software.intel.com/en-us/forums/topic/280663

This seems odd, since 5:2 is four bits and they are clear in reporting that there are only 8 banks. In the forum posts, the Intel employees were clearly not being permitted to disclose the full details, so my curiosity was aroused.

The example code that they provide in section 3.6.1.3 (example 3-37) attempts to load two 32-bit items from the same offset within two different cache lines mapping to the same cache set. This does demonstrate bank conflicts, but not very many. (The loads can dual-issue after the first cycle -- so the code takes 5 cycles to perform the 8 loads instead of 4 cycles.) Repeating the loop a million times and using performance counter event BFh, umask 05h: L1D_BLOCKS.BANK_CONFLICT_CYCLES confirmed the stalls.

Unfortunately the "corrected" version that they provide does not demonstrate that a difference in bits 5:2 will avoid a bank conflict.
Instead, it demonstrates that loading two adjacent 32 bit values from the *same* cache line results in no conflicts -- which is no surprise at all.

So I built a code similar to their example except that all 8 loads were to the same offset of 8 different cache lines that mapped to the same cache set. This gave a measured bank conflict rate close to my estimate of 7/8 (since there is no stall counted for the first of the 8 loads and the conflict continues for all loads after the first.)

Then I modified the offsets so that the 8 loads were to consecutive 32-bit locations in 8 different cache lines that mapped to the same set. I.e., a stride of 17 32-bit words instead of 16 32-bit words. This gave zero conflicts and directly confirms that a difference in address bit 2 is enough to prevent a bank conflict (at least for 32-bit loads). That is quite an interesting result because it does not fit easily into the model of a cache having 64-bit wide or 128-bit wide banks (as you suggest in section 9.13 of your microarchitecture reference guide).

My current hypothesis is that the cache has 8 banks that are each 32 bits wide, but run at twice the processor core frequency -- giving an effective width of 64 bits, but a granularity of access of 32 bits -- almost the same as having 16 banks. The main idea is that each bank can accept two addresses per cycle and deliver two 32-bit results from different lines, but with the critical limitation that it can only deliver the low-order 32-bits in the first half-cycle and can only deliver the high-order 32-bits in the second half-cycle. This combination of features is the only mechanism I could think of that retains the bank conflict seen when bits 5:2 match but which allows dual issue when bits 5:2 differ.

Technologically, a double-speed cache appears possible -- experiments with the CACTI cache simulator (http://www.hpl.hp.com/research/cacti/) suggest that a 32 KiB cache of similar configuration should be able to run at up to about 7.5 GHz in a 32nm process technology, with an area similar to what I estimate from the Sandy Bridge die photos.

I have reviewed many of the other possible combinations of alignment for a pair of loads and my hypothesis appears to provide a plausible explanation of the observed behavior. There are some problematic cases with combinations that include a 128-bit load on a 32-bit boundary where my model suggests a bank conflict even when bits 5:2 differ, but I am not sure that Intel's claims about the ability to dual-issue are intended to cover all such misalignments (and I have not coded any of these cases to see if they actually generate bank conflicts).

This is part of work that I am doing to develop a set of microbenchmarks that can be used to document the behavior of hardware performance counters so that I can have some hope of using them to understand application characteristics. I have not had time to review your latency and throughput test codes yet, but I hope that with some modification (mostly controlling where the data is located when data motion instructions are executed) they will be useful in illuminating the specifics of what the performance counters are actually counting....

   
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2013-10-10 10:34
Thank you for your comments John.

I think it is unlikely that the cache could be running at double clock frequency. It is too big for that. Some previous models have run the cache at half clock frequency. Maybe your observations have something to do with the fact that the Sandy Bridge has two read ports?

Have you tried on Haswell? It should have no cache bank conflicts

   
Test results for Intel's Sandy Bridge processor
Author:  Date: 2013-10-11 19:11
I am not an SRAM expert, but my experiments with CACTI suggest that a double-speed 32 KiB cache is possible in a 32 nm process. It is certainly possible that I am misunderstanding the results.

It seems to me that the available documentation leaves a lot of uncertainty about how the cache ports relate to the cache SRAM banks. Intel's comments that a 16-Byte read may access as many as three banks strongly implies that the banks are 8 Bytes wide. Another possibility is that the addresses are "swizzled" somehow, but I have been unable to come up with a swizzling scheme that matches Intel's descriptions or my observations. Still a possibility, of course. I did not work on the core microarchitecture when I was at AMD, but my impression was that aligned 16 Byte loads in the Family 10h processors were serviced by a single bank.

We don't have any Haswell systems at TACC --- I think that it will be the second half of next year before the two-socket Haswell-based servers are available. We will probably have access a bit earlier. I also read Intel's comments about the absence of bank conflicts in Haswell, and am looking forward to testing the new technology.

   
SB's L1D banks
Author: Tacit Murky Date: 2013-11-03 03:29
Hello, John.

In our (ixbt.com) low-level tests we have confirmed that L1D have 8-byte banks (that was also confirmed by SB arch. team engineer) with 5:3 bits allinged. Solving 4-byte accesses case is easy: OoO mem access (Intel term: MD) will reorder reads to issue them in different banks — A+0 & A+8, then A+4 & A+12, then same for next 4 reads and 2 banks, etc. (A = line's address). Also, delaying 1st access (having a «conflict» event for PMC), it's possible to issue all other loads without reordering, still having different banks: A+0 & (none), A+4 & A+8, A+12 & A+16…

DDR for cache bit-lines is possible but removes possibility of (practically — need for) precharge. Without precharge bit-lines will have to swing 0<=>1 and back up to twice per clock. That requires fast (HT) transistors with high parameter uniformity (a Big Problem for 45 nm and bellow) and, most important, will ruin performance/watt metric for such a cache. And both Intel & AMD are avoiding this at all costs — like converting 6T bit-cells to 8T (for L1's and L2's) just to save power.

But I'm still curious: how Intel resolved bank conflicts in Haswell. Naive solution is to make all banks 3-ported (2R+W), that would require 10T-cells. But early die-shots show just slighly larger L1D area cf. IB with same aspect ratio. Hm?…

While we're at it, can I ask why AMD's memory controllers are so slow, especially on writes? Never can they achieve even 50% of theoretical peak throughput. Intel can do more. See AIDA64 «cache & memory benchmark» results, like this: www.easycom.com.ua/data/nouts/1302101905/img/38_aida64_memory-cache.jpg .

   
SB's L1D banks
Author:  Date: 2013-11-07 16:40
Thanks to Tacit Murky for the comments. I like the reordering trick, but it only works if you have accesses to different banks that can be re-ordered. In my original analysis I did not make this assumption. Consider, for example, performing a dot product on vectors of 32-bit values, each with a stride of 64B, and with a modulo-64 offset of 4 Bytes. Every load will access the same bank, so I think this case will have lots of conflicts, but every pair of loads differs in bit 2, so the pairs do not match in bits 5:2 and therefore (according to the wording of the optimization reference manual section 3.6.1.3, page 3-43) should *not* experience bank conflicts.

I had intended to test this particular case, but now that I look at my code I see that my code with offsets does roll over all of the banks (using a stride of 68 Bytes), so the reordering trick is sufficient to explain the observed drop in bank conflicts.

Concerning the write bandwidth on the AMD processors: Recent Intel processors (Nehalem & newer) have 10 "Line Fill Buffers" per core, and use these for streaming stores. In contrast, the AMD Family 10h processors have 8 "Miss Address Buffers" that are used for cacheable L1 misses (load or store) and 4 separate "Write Combining Buffers" that are used for streaming stores. This gives the AMD Family 10h processor significantly less potential concurrency for stores. Unfortunately it is quite difficult to estimate the amount of time that a buffer needs to be occupied for a streaming store operation, so it is not obvious how to determine whether the streaming store performance is concurrency-limited. In both AMD and Intel systems, the buffers used by the cores to handle streaming stores will hand off the data to the memory controller at some point, so they will probably have shorter occupancy than what is required for reads (since the buffers have to track reads for the full round trip), but the specifics of the hand-off are going to be implementation dependent and I don't see any obvious methodology for estimating occupancy. Once the streaming stores have been handed off to memory controller queues things are even less clear, since the number of buffers in the memory controller does not appear to be documented, and the occupancy in those buffers will depend on details of the coherence protocol that are unlikely to be discussed in public.

A brief look at the BIOS and Kernel Developer's Guide for the AMD Family 15h processors suggests that the cache miss buffer architecture has been changed significantly, but I have not worked through the details. I did find a note in AMD's Software Optimization Guide for Family 15h Processors (publication 47414, revision 3.06, January 2012) that says that Family 15h processors have about the same speed as Family 10h processors when writing a single write-combining stream, but may be slower when writing more than one write-combining stream. I have a few Family 15h boxes in my benchmarking cluster, but since our production systems are all currently Intel-based, I have not had much motivation to research the confusing bandwidth numbers that I obtained in my initial testing.

   
Test results for Intel's Sandy Bridge processor
Author:  Date: 2015-08-18 09:45
Hi Agner,
When I was doing some very fine-grained performance testing on Haswell (Xeon E5-2667 v3), I saw some anomalies that reminded me of your comments on the AVX "warm-up" period on Sandy Bridge. The test code is an L1-contained summation of a single vector. For N=2048 and 256-bit VADDPD instructions, it should take 512 cycles (plus some overhead).
What I observed was
(1) an initial "emulation" period of 4-7 iterations that took ~2200 cycles each,
(2) a "transition" iteration that took over 31,000 cycles -- about 25,500 halted, and about 5500 active,
(3) "normal" behavior of 512 or 516 cycles for the rest of the iterations (after subtracting the approximate overhead).

I added an outer loop with a (non-256-bit) "spinner" to see how long it takes for the processor to revert to initial behavior. If the spinner between outer loop iterations was less than 1 millisecond, the subsequent inner iterations ran at full speed. If the spinner between outer loop iterations was more than 1 millisecond, the subsequent inner iterations showed the behavior above.

This behavior occurs even if the core frequency is bound any of the available frequencies (except perhaps the lowest frequency -- I need to go back and double-check those results). Performance counters showed that the core was running at the requested frequency in each case (comparing actual and reference cycles gave the expected ratio).
There are no kernel cycles, even during the transition.
The performance counters for micro-ops dispatched to the various ports show only very minor differences between the "warm-up" and "normal" cycles.
I could not find *any* performance counters (other than cycles) that could distinguish between the 1/4-speed "warm-up" and "normal" operations (but I have not tried all of them).

So this looks like a very low-level emulation of the 256-bit pipeline by forcing everything through the bottom 128-bit pipe, with a remarkably slow transition when the upper 128-bit pipe is enabled. Perhaps the current draw is so large that the chip has to wait for the voltages to settle, even with no frequency change?

I did not look for evidence of the overhead of the transition in the other direction -- I assume it will be much quicker to turn off the upper 128-bit FP pipe than to turn it on.

   
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2015-08-18 10:52
John D. McCalpin wrote:
So this looks like a very low-level emulation of the 256-bit pipeline by forcing everything through the bottom 128-bit pipe, with a remarkably slow transition when the upper 128-bit pipe is enabled.
Thank you for sharing your findings. I wonder if it is possible to distinguish between running at reduced speed and running in the lower 128 bit lane.
   
Test results for Intel's Sandy Bridge processor
Author:  Date: 2015-08-24 11:22
My test code measured the elapsed TSC time (using RDTSCP) and used the performance counters to measure Unhalted Core Cycles and Unhalted Reference Cycles (with inline RDPMC instructions). The difference between TSC cycles and Unhalted Reference Cycles gives the number of halted cycles, and the ratio of Unhalted Core Cycles to Unhalted Reference Cycles gives the average core frequency while not halted (relative to the nominal frequency). This data is sufficient to clearly distinguish low-frequency operation from operation in a degraded performance mode. It might not be enough to identify operation with T-state throttling -- I have never tried to use that feature....
   
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2015-08-25 00:28
John D. McCalpin wrote:
This data is sufficient to clearly distinguish low-frequency operation from operation in a degraded performance mode.
I think it runs at reduced frequency or with idle clocks in between. If it was running 256-bit instructions through the lower 128-bit unit, you would probably see half speed, not quarter speed, and double the number of retired µops.
   
Test results for Intel's Sandy Bridge processor
Author:  Date: 2015-08-25 11:58
If it were only a matter of arithmetic I would also expect the code to run at 1/2 speed when using only the lower 128-bit pipe. However on Haswell the transfer of data from the upper 128 bits of the AVX registers to the lower 128 bits has a 3-cycle latency, and although this can be fully pipelined in software (at 1 op/cycle), it is easy to believe that a hardware emulation mode that is only intended to be run for a minute fraction of the total cycles might not fully pipeline the "cross-lane" transfer across multiple instructions.

The uop count is a matter of how the engineers chose to implement the feature. If the implementation is internal to the functional unit, then it would not require extra uops, and I do not see any significant change in uop counts between the "slow" and "normal" phases. (The uop counts are elevated for the iteration that includes the transition, but it is not at all clear what is happening in that step.)

I think I forgot to mention that, just as you noticed on Sandy Bridge, there are no "warm-up" effects when using scalar AVX operations or 128-bit SSE operations. (I did not check 128-bit AVX, but there is not a lot of reason for that to be different than 128-bit SSE.) My assumption that the data is running though the "lower" 128-bit pipe is based in part on the observation that the 128-bit pipeline is available at full speed at all times. From an implementation perspective running the 256-bit operations at a lower frequency does not make a lot of sense when there is a full-speed 128-bit pipeline ready to use.

   
Haswell upper128 power gating
Author:  Date: 2015-08-28 22:58
John D. McCalpin wrote:
What I observed was
(1) an initial "emulation" period of 4-7 iterations that took ~2200 cycles each,
(2) a "transition" iteration that took over 31,000 cycles -- about 25,500 halted, and about 5500 active,
(3) "normal" behavior of 512 or 516 cycles for the rest of the iterations (after subtracting the approximate overhead).

The huge number of halted cycles in (2) makes it likely that this isn't just a timer interrupt hitting one of your iterations or something. Probably not migrating from one core to another, either.

I agree with your speculation that this is probably the core halting for voltages to settle after powering up the high half of the execution units. I would have thought it might be possible to keep doing (1) "emulation" while the upper half settled, but this shows that Haswell doesn't work that way.

I wouldn't be so quick to assume that powering down the upper 128 doesn't halt for a lot of cycles, too. There will be some capacitance, so the supply voltage won't go to zero instantly. Garbage signals coming out of the upper128 vector units as the charge dissipates could well be a problem. Clearly there isn't gating to protect the rest of the execution unit from this, or emulation mode could continue while the upper128 powered up, and you'd go from (1) to (3) without a slow transition iteration. (not *that* slow, anyway.)

If we're lucky, powering down the upper128 of the vector units won't slow down integer code that uses different execution units, even though the integer execution units are on the same ports as the vector execution units. So it would be useful to alternate xmm and ymm vector loops, and ymm with non-vector loops, to look for a difference in the number of halted cycles when the CPU decides to power down the upper128.

Maybe the CPU's internal power management won't power down the upper128 unless the core is halted for another reason? Your 1ms of spin-loop threshold seems to rule that out, though.

I assume the whole core halts, affecting both hyperthreads, because it's due to a physical process. I guess you could look for this effect by timing a loop repeatedly on the other hardware thread, and recording timestamps for anomalies. If the timestamp for an extra-slow iteration in one thread was close to the timestamp for the transition iteration in the 256b-vector loop, then you could conclude that the whole core halted.

I added an outer loop with a (non-256-bit) "spinner" to see how long it takes for the processor to revert to initial behavior. If the spinner between outer loop iterations was less than 1 millisecond, the subsequent inner iterations ran at full speed. If the spinner between outer loop iterations was more than 1 millisecond, the subsequent inner iterations showed the behavior above.

This behavior occurs even if the core frequency is bound any of the available frequencies (except perhaps the lowest frequency -- I need to go back and double-check those results). Performance counters showed that the core was running at the requested frequency in each case (comparing actual and reference cycles gave the expected ratio).
There are no kernel cycles, even during the transition.
The performance counters for micro-ops dispatched to the various ports show only very minor differences between the "warm-up" and "normal" cycles.
I could not find *any* performance counters (other than cycles) that could distinguish between the 1/4-speed "warm-up" and "normal" operations (but I have not tried all of them).

I'm not surprised that this is unrelated to frequency. Power-gating the upper128 of the vector units is a win at any frequency. Saving power at max frequency allows you to stay at max turbo longer. (Not to mention battery life.)

I think one compelling reason for doing it at a low level inside the execution units, rather than with special uops, is that Intel CPUs that support AVX also have a uop cache. You don't want to have to mark lines in the uop cache as "decoded for 128b-emulation" vs. "decoded for 256b vector units", and then potentially re-decode after powering up / down the upper128.

OTOH, extra uops could be generated on the fly in the scheduler that follows the ROB (re-order buffer), when uops are converted from fused-domain to unfused domain. If that's how it works, these uops could be flagged as "internally generated" so the perf counters don't count them. They may need to be flagged this way anyway, for things to work correctly. I doubt Intel would add extra complexity just for perf-counter bookkeeping to hide the internals. You did look at all the different uop issue / execute / retire counters, some of which count in the fused domain, and some of which count unfused uops, right? You said you looked at uops dispatched to ports, so I guess that should cover the unfused domain.

As you point out, it's a bit surprising that perf is worse than half. Pentium M had 64b execution units, and took longer for 128b vector ops, but only about twice as long. In that case, though, 128b vector ops decoded to 2 uops, instead of having shuffling within the execution unit. Maybe this emulation mode isn't fully pipelined, or the unusual latency creates write-back conflicts?

If emulation mode was fairly efficient, the upper128 might never need to power on for 256b code that was limited by memory bandwidth, frontend (ROB not filling up), or insn latency rather than throughput. Even 1/4 perf might still be efficient enough for some cases.

Maybe it would have taken more transistors to make emulation mode faster, and they decided it wasn't worth it to speed up the slow mode and be less aggressive in powering up the upper128.

   
Haswell upper128 power gating
Author: Agner Date: 2016-01-16 03:23
Session number SPCS001 in Intel Developer Forum 2015 tells that the Skylake can power down the upper 128-bit half of the 256-bit execution engine when it is not used: myeventagenda.com/sessions/0B9F4191-1C29-408A-8B61-65D7520025A8/7/5

This is presented as an innovation in Skylake. What John McCalpin has observed in previous processors is perhaps a different power-saving mechanism?

   
Haswell upper128 power gating
Author:  Date: 2016-01-29 13:51
The IDF Skylake presentation seems to be saying something quite different than powering down the upper 128-bit lanes. The slide says the AVX2 infrastructure is powered down when not in use -- it says nothing about lanes or about 128 bits -- and the presenter was pretty clear, saying that the whole AVX2 "area" was powered off. This does lead to some problems of interpretation, since it is not clear whether this means only the AVX2 extensions (and not AVX v1, which is also 256-bits wide) or whether the processor keeps (at least) one 64-bit FP pipeline powered up. One can imagine that the number of applications that use no AVX2 instructions is quite large, the number of applications that use no 256-bit registers is quite large, but the number of applications that use no floating-point at all is not nearly as large. Of course no hint is provided about the cost of the transition.

So it looks like both Sandy Bridge, Haswell, and Skylake (client) turn off the upper 128 bits of the SIMD pipelines, but only the Haswell pays the ~10 microsecond stall when the upper lanes are turned on. It may not be a coincidence that of these processors, only Haswell uses in-package voltage regulators. One might speculate that the smaller in-package voltage regulators are unable to hold the voltage steady under large load increases, so powering up the upper 128-bit SIMD lanes requires a stall while the voltage recovers. The 10 microsecond stall is similar in magnitude to the stalls on p-state changes on earlier processors. I don't think that I have seen good measurements of the overhead of p-state changes in Xeon E5 v3 processors.

Of course we know nothing about the nature of the power-saving modes in either Sandy Bridge or Haswell. For example, one might speculate that turning off the clocks to the upper 128-bit SIMD lanes (but leaving the power on) would produce less power saving, but also less voltage drop when the clocks are re-enabled.

There are still some anomalies. Re-reading Agner's comments leaves me with the impression that he has not seen the ~10 microsecond stall on any processors tested. Is this correct? If so, which Haswell models were tested?
On 2015-12-24, Robert noted that he sees the ~10 microsecond stall on Core i7-4770K and Core i7-4700MQ (both Haswell "client" parts), but not on the Core i7-5820K (a "Haswell E" part -- basically a Haswell EP server part in a client configuration). Although experimental error is always a possibility, it is conceivable that some products (perhaps specifically those without all cores enabled) might not suffer enough voltage drop to require this stall?

I imagine a 10 microsecond stall could be very upsetting to some people working in real-time signal processing, so it would be nice to know which processors show this behavior and which do not. One would guess that Skylake will also experience a large stall when it needs to enable the AVX2 "area", but it is not clear how Intel is managing this transition. Looking forward, one would imagine that the power implications of enabling/disabling the 512-bit SIMD units in AVX-512 could lead to even larger disruptions?

   
Haswell upper128 power gating
Author: Agner Date: 2016-01-30 01:23
John D. McCalpin wrote:
the presenter was pretty clear, saying that the whole AVX2 "area" was powered off.
I don't think there is a special "area" just for AVX2. The execution units are
divided by functionality, not by instruction set. The commercial presentation was just simplifying things.
Re-reading Agner's comments leaves me with the impression that he has not seen the ~10 microsecond stall on any processors tested. Is this correct? If so, which Haswell models were tested?
It was not a stall, but a 14 µs period of reduced throughput for all 256-bit instructions, both AVX and AVX2. I have seen this only on Skylake. The processors tested are listed in my instruction tables (Haswell family 6 model 3C).
   
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2015-12-20 05:56
John D. McCalpin wrote:
Hi Agner,
When I was doing some very fine-grained performance testing on Haswell (Xeon E5-2667 v3), I saw some anomalies that reminded me of your comments on the AVX "warm-up" period on Sandy Bridge.
I am testing the Skylake processor now, and it has a warm-up period for 256-bit vector operations of approximately 65,000 clock cycles. During this period, all 256-bit instructions take ~4.5 times as many clock cycles as normally. After the warm-up period, the 256-bit instructions have the same latency and throughput as similar 128-bit instructions.

I am not seeing the same phenomenon on any of the previous processors. Maybe you have a later version of Haswell. What is the CPUID and stepping number?

   
Test results for Intel's Sandy Bridge processor
Author:  Date: 2015-12-21 16:38
Most of my testing was on various Xeon E5-2660 v3 processors. The Xeon E5 v3 specification update says that these are CPUID 0x306F2, Stepping M1.
I have seen the same behavior as recently as this morning on a system with Xeon E5-2680 v3 processors (same CPUID and stepping).

Note that only the Xeon E5 v3 parts have the differentiated frequencies for 256-bit operation, so I would not be surprised if the "client" Haswell parts did not show this behavior.

I repeated these tests on a Sandy Bridge (Xeon E5-2680) and found the same "half-speed" operation that you reported earlier. In my experiments the "half-speed" operation lasted for up to a few thousand cycles, but the transition to full speed operation incurred no stall cycles. It also appears that the "full speed" mode of operation is not retained as long -- even short (much less than 1 millisecond) periods of not using the 256-bit registers resulted in switching back to the slower mode.

   
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2015-12-22 01:42
This is interesting. I can see this warm-up behavior on an Intel Skylake i7-6700. It seems to use the lower 128-bit lane twice for 256-bit instructions during a warm-up period of 14 µs before the 256-bit instructions can run at full speed. It goes back to the cold state after 675 µs of no 256-bit instructions.

I have never seen this behavior on any other processor. It would be interesting to know which processors have this behavior and which ones have not.

   
Test results for Intel's Sandy Bridge processor
Author:  Date: 2015-12-24 04:04
Hi all, I have three haswell machines, so I decided to test this phenomena on all of them. I'm not doing some exact measurements of cycles using MSR's, I used my old program that I created around two years ago to test FMA implementation of Bresenham's line algorithm. I calculate pixels of rasterized line and I measure the calculation duration using RDTSC, but the results are telling something about this warm-up effect. I am calculating the same line in 100 consecutive cycles, in ideal conditions, single iteration should take around 600 cycles. Now the results:

machine 1: this is my personal desktop pc with Core i7-4770K, I bought it right when the haswell cpu's were released, in june 2013.
first iteration takes cca 4000 cycles, second 2000 cycles and then third iteration takes more than 30000 cycles. All subsequent iterations take 600 cycles. The long third iteration is not caused by context switch or something, since the results are consistent between multiple runs or pinning to single core.
machine 2: work laptop with Core i7-4700MQ, bought in autumn 2013
the results are consistent with the first machine, with slight differences, first iteration cca 5000 cycles, second 2000 cycles, third 22000 cycles, subsequent iterations 600 cycles
both machines are running windows 10
the results in linux are only slightly different, there the first iteration takes cca 32000 cycles and subsequent iterations 600 cycles (maybe the difference is caused by intel P-state driver in linux?)

now I don't remember exact results of my experiments when I created this program two years ago, but I think that this "third iteration slowdown" wasn't happening back then, only a few longer iterations at start and then the rest of the iterations were fast.
the only change which happened during last two years was that I updated bios on both machines, which updated the cpu microcode. Current microcode revision of both cpus is 0x1E. (cpuid revision 0x306C3, stepping C0, platform ID of the first is 0x3E/0x02, second 0x27/0x10)

machine 3: this is new powerful workstation that we bought in work with Core i7-5820k. Here the results are quite different
first iteration takes cca 6000 cycles, then there are 5 iterations that take 1800 cycles and all subsequent iterations take 600 cycles. There is not that long transition iteration with over 20000 cycles.
This machine is running windows 8.1 and microcode revision is 0x29, CPUID revision 0x306F2, platform id 0x49/0x04, stepping R2.

So, my uneducated guess is that this behavior might be caused also by the different versions of microcode. I might try downgrading the bios of my desktop and try these tests with older microcode

   
Test results for Intel's Sandy Bridge processor
Author:  Date: 2015-12-25 15:10
Why microseconds ? It would be more precise if you measured the difference using different frequency as most likely it is measured internally in cycles. Something like internal counter to shut down or power up unofficial ports (which rises the question how many of them are there, actually). Switching time is most likely limited by the time necessary to flush the internal compiled code (as confirmed by some polymorphic code tests) and probably the interpreter (decoder), should they decrease the timeout, there might be some decent performance drop.
   
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2015-12-26 01:01
Just_Coder wrote:
Switching time is most likely limited by the time necessary to flush the internal compiled code
56,000 clock cycles at 4 GHz. This clock count is too high to be explained by a pipeline flush. More likely it is the time needed to power up the circuits and charge some internal capacitors.
   
Test results for Intel's Sandy Bridge processor
Author:  Date: 2015-08-23 00:35
From recent testing - some uncertainties - do you think partial decoding (instruction length and such) takes place on the stage of filling the cache (L1i) ?
   
Test results for Intel's Sandy Bridge processor
Author: Agner Date: 2015-08-25 00:12
Just_Coder wrote:
From recent testing - some uncertainties - do you think partial decoding (instruction length and such) takes place on the stage of filling the cache (L1i) ?
Instruction boundaries are marked in the instruction cache on AMD processors but not on most Intel processors