Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Test results for Intel's Sandy Bridge processor
Author:  Date: 2013-10-09 13:14
First -- thank you very much for your performance work -- this is by far the most comprehensive and accurate compilation of microarchitecture and performance data that I have been able to find since I left the AMD processor design team at the end of 2008, and it has been very helpful in my ongoing studies of core performance issues. (Most of my prior work has been on memory systems and coherence protocols -- e.g., www.cs.virginia.edu/stream/ -- but now I am trying to learn more about core microarchitecture and performance and power).

This note concerns the L1 Data Cache Banking on Intel's Sandy Bridge (and presumably Ivy Bridge) processors.

Intel's Performance Optimization reference manual (document 248966-028, July 2013) says that Sandy Bridge cores will have an L1 Data Cache bank conflict if two loads ready for issue in the same cycle to two different cache lines in the same cache set have matching bits 5:2.
Actually they say bits 4:2 in section 2.2.5.2 (page 2-20) and bits 5:2 in section 3.6.1.3 (page 3-43), but an Intel employee confirmed that the latter was correct in a forum post at software.intel.com/en-us/forums/topic/280663

This seems odd, since 5:2 is four bits and they are clear in reporting that there are only 8 banks. In the forum posts, the Intel employees were clearly not being permitted to disclose the full details, so my curiosity was aroused.

The example code that they provide in section 3.6.1.3 (example 3-37) attempts to load two 32-bit items from the same offset within two different cache lines mapping to the same cache set. This does demonstrate bank conflicts, but not very many. (The loads can dual-issue after the first cycle -- so the code takes 5 cycles to perform the 8 loads instead of 4 cycles.) Repeating the loop a million times and using performance counter event BFh, umask 05h: L1D_BLOCKS.BANK_CONFLICT_CYCLES confirmed the stalls.

Unfortunately the "corrected" version that they provide does not demonstrate that a difference in bits 5:2 will avoid a bank conflict.
Instead, it demonstrates that loading two adjacent 32 bit values from the *same* cache line results in no conflicts -- which is no surprise at all.

So I built a code similar to their example except that all 8 loads were to the same offset of 8 different cache lines that mapped to the same cache set. This gave a measured bank conflict rate close to my estimate of 7/8 (since there is no stall counted for the first of the 8 loads and the conflict continues for all loads after the first.)

Then I modified the offsets so that the 8 loads were to consecutive 32-bit locations in 8 different cache lines that mapped to the same set. I.e., a stride of 17 32-bit words instead of 16 32-bit words. This gave zero conflicts and directly confirms that a difference in address bit 2 is enough to prevent a bank conflict (at least for 32-bit loads). That is quite an interesting result because it does not fit easily into the model of a cache having 64-bit wide or 128-bit wide banks (as you suggest in section 9.13 of your microarchitecture reference guide).

My current hypothesis is that the cache has 8 banks that are each 32 bits wide, but run at twice the processor core frequency -- giving an effective width of 64 bits, but a granularity of access of 32 bits -- almost the same as having 16 banks. The main idea is that each bank can accept two addresses per cycle and deliver two 32-bit results from different lines, but with the critical limitation that it can only deliver the low-order 32-bits in the first half-cycle and can only deliver the high-order 32-bits in the second half-cycle. This combination of features is the only mechanism I could think of that retains the bank conflict seen when bits 5:2 match but which allows dual issue when bits 5:2 differ.

Technologically, a double-speed cache appears possible -- experiments with the CACTI cache simulator (http://www.hpl.hp.com/research/cacti/) suggest that a 32 KiB cache of similar configuration should be able to run at up to about 7.5 GHz in a 32nm process technology, with an area similar to what I estimate from the Sandy Bridge die photos.

I have reviewed many of the other possible combinations of alignment for a pair of loads and my hypothesis appears to provide a plausible explanation of the observed behavior. There are some problematic cases with combinations that include a 128-bit load on a 32-bit boundary where my model suggests a bank conflict even when bits 5:2 differ, but I am not sure that Intel's claims about the ability to dual-issue are intended to cover all such misalignments (and I have not coded any of these cases to see if they actually generate bank conflicts).

This is part of work that I am doing to develop a set of microbenchmarks that can be used to document the behavior of hardware performance counters so that I can have some hope of using them to understand application characteristics. I have not had time to review your latency and throughput test codes yet, but I hope that with some modification (mostly controlling where the data is located when data motion instructions are executed) they will be useful in illuminating the specifics of what the performance counters are actually counting....

 
thread Test results for Intel's Sandy Bridge processor new - Agner - 2011-01-30
reply Test results for Intel's Sandy Bridge processor new - PaulR - 2011-02-15
replythread AVX2 new - phis - 2011-06-23
last reply AVX2 new - Agner - 2011-06-23
replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-01
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-06
last replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-08
last replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-08
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-09
last replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-09
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-10
last reply Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-10
replythread Test results for Intel's Sandy Bridge processor - John D. McCalpin - 2013-10-09
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-10-10
last replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2013-10-11
last replythread SB's L1D banks new - Tacit Murky - 2013-11-03
last reply SB's L1D banks new - John D. McCalpin - 2013-11-07
replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2015-08-18
replythread Test results for Intel's Sandy Bridge processor new - Agner - 2015-08-18
last replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2015-08-24
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2015-08-25
last reply Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2015-08-25
replythread Haswell upper128 power gating new - Peter Cordes - 2015-08-28
last replythread Haswell upper128 power gating new - Agner - 2016-01-16
last replythread Haswell upper128 power gating new - John D. McCalpin - 2016-01-29
last reply Haswell upper128 power gating new - Agner - 2016-01-30
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2015-12-20
last replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2015-12-21
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2015-12-22
reply Test results for Intel's Sandy Bridge processor new - Robert - 2015-12-24
last replythread Test results for Intel's Sandy Bridge processor new - Just_Coder - 2015-12-25
last reply Test results for Intel's Sandy Bridge processor new - Agner - 2015-12-26
last replythread Test results for Intel's Sandy Bridge processor new - Just_Coder - 2015-08-23
last reply Test results for Intel's Sandy Bridge processor new - Agner - 2015-08-25