Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Test results for Broadwell and Skylake
Author:  Date: 2016-01-12 13:54
I just ran across some performance counter bugs on Haswell that may influence one's interpretation of instruction retirement rates and may bias measurements of uops per instruction.

I put performance counters around 100 (outer) iterations of a simple 10-instruction loop that executed 1000 times. According to Agner's instruction tables this loop should have 12 uops. Both the fixed-function "instructions retired" and the programmable "INST_RETIRED.ANY_P" events report 12 instructions per loop iteration (not 10), while the UOPS_RETIRED.ALL programmable counter event reported 14 uops per loop iteration (not 12). While I could be misinterpreting the uop counts, there is no way that I could have mis-counted the instructions --- it took all of my fingers, but did not generate an overflow condition. ;-)

It turns out that there are a number of errata for both the instructions retired events and the uops retired event on all Intel Haswell processors. Somewhat perversely, the different Haswell products have different errata listed, even though they have the same DISPLAYFAMILY_DISPLAYMODEL designation, but all of them that I checked (Xeon E5 v3 (HSE71 in doc 330785), Xeon E3 v3 (HSW141 in doc 328908), and 4th Generation Core Desktop (HSD140 in doc 328899)) include an errata to the effect that the "instructions retired" counts may overcount or undercount. This errata is also listed for the 5th Generation Core (Broadwell) processors (BDM61 in doc 330836), but is not listed in the "specification update" document for the Skylake processors (doc 332689).

For this particular loop the counts are completely stable with respect to variations in loop length (e.g., from 500 to 11000 shows no effect other than asymptotically decreasing overhead). The machine is running with HyperThreading enabled, but there are no other users or non-OS tasks and this job was pinned to (local) core 4 on socket 1, so there is no way that interference with another thread (mentioned in several other errata) could account for seeing identical behavior over several hundred trials.

Reading between the lines, the language that Intel uses in the descriptions of this performance counter errata seems consistent with the language used in other cases for which the errors are not "large" (not approaching 100%), but are also not "small" (not limited to single-digit percentages). It is very hard to decide whether I want to take the time to try to characterize or bound this particular performance counter error. It may end up having an easy story, or it may end up being completely inexplicable without inspection of the processor RTL.

 
thread Test results for Broadwell and Skylake new - Agner - 2015-12-26
replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-26
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Agner - 2015-12-27
last replythread Sustained 64B loads per cycle on Haswell & Sky new - Nathan Kurz - 2015-12-27
reply Sustained 64B loads per cycle on Haswell & Sky new - John D. McCalpin - 2016-01-04
reply Sustained 64B loads per cycle on Haswell & Sky new - T - 2016-06-18
last reply Sustained 64B loads per cycle on Haswell & Sky new - Jens Nurmann - 2017-01-12
replythread Test results for Broadwell and Skylake new - Peter Cordes - 2015-12-28
last reply Test results for Broadwell and Skylake new - Agner - 2015-12-29
replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-01-04
last replythread Test results for Broadwell and Skylake new - Agner - 2016-01-05
last replythread Test results for Broadwell and Skylake new - Tacit Murky - 2016-03-09
last reply Test results for Broadwell and Skylake new - Tacit Murky - 2016-06-05
replythread Minor bug in the microarchitecture manual new - SHK - 2016-01-10
last reply Minor bug in the microarchitecture manual new - Agner - 2016-01-16
replythread Test results for Broadwell and Skylake - John D. McCalpin - 2016-01-12
last replythread Test results for Broadwell and Skylake new - Jess - 2016-02-11
last reply Description of discrepancy new - Nathan Kurz - 2016-03-13
reply Test results for Broadwell and Skylake new - Russell Van Zandt - 2016-02-22
replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-23
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-24
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-04-26
last replythread Instruction Throughput on Skylake new - Agner - 2016-04-27
last replythread Instruction Throughput on Skylake new - T - 2016-06-18
reply Instruction Throughput on Skylake new - Agner - 2016-06-19
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-08
last replythread Instruction Throughput on Skylake new - Nathan Kurz - 2016-07-11
replythread Instruction Throughput on Skylake new - Tacit Murky - 2016-07-17
last replythread Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-11
reply Haswell register renaming / unfused limits new - Tacit Murky - 2017-05-11
last reply Haswell register renaming / unfused limits new - Peter Cordes - 2017-05-12
last reply Instruction Throughput on Skylake new - T - 2016-08-08
reply Unlamination of micro-fused ops in SKL and earlier new - Travis - 2016-09-09
replythread 32B store-forwarding is slower than 16B new - Peter Cordes - 2017-05-11
last replythread 32B store-forwarding is slower than 16B new - Fabian Giesen - 2017-06-28
last reply 32B store-forwarding is slower than 16B new - Agner - 2017-06-28
reply SHL/SHR r,cl latency is lower than throughput new - Peter Cordes - 2017-05-27
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Agner - 2017-05-30
last replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-05-30
last replythread Test results for Broadwell and Skylake new - - - 2017-06-19
replythread Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-20
replythread Test results for Broadwell and Skylake new - Bulat Ziganshin - 2017-06-21
reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-06-26
last replythread Test results for Broadwell and Skylake new - - - 2017-07-05
last replythread Test results for Broadwell and Skylake new - - - 2017-07-12
last reply Test results for Broadwell and Skylake new - Jorcy Neto - 2017-07-19
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-28
last replythread Test results for Broadwell and Skylake new - Travis - 2017-06-29
last replythread Test results for Broadwell and Skylake new - Xing Liu - 2017-06-30
last reply Test results for Broadwell and Skylake new - Travis - 2017-07-13
reply Official information about uOps and latency SNB+ new - SEt - 2017-07-17
last replythread Test results for Broadwell and Skylake new - Armand Behroozi - 2020-10-07
last reply Test results for Broadwell and Skylake new - Agner - 2020-10-11