Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Test results for Knights Landing
Author:  Date: 2016-12-08 12:21
I have done testing with random permutations and with the hardware prefetchers disabled (and with both at the same time), and the simple stride results with no HW PF match the permuted results with HW PF enabled once the permutation block gets big enough.
I did these tests back in July, and we have changed a number of aspects of the system configuration since then, but I think that
Transparent Huge Pages were enabled when I did these tests. I don't recall if this was before or after we disabled some of the C states. The "untile" frequency may also make a difference -- it automatically ramps up to full speed when running bandwidth tests, but when running latency tests the Power Control Unit may not think that the "untile" is busy enough to justify ramping up the frequency.

Without knowledge of the tag directory hash, the processor placement, the MCDRAM hash, etc, it is challenging to make a lot of sense of the results. On KNC the RDTSC instruction had about a 5 cycle latency, so I was able to do a lot more with timing individual loads, and the single-ring topology made the analysis easier.

There are more performance counters in the "untile" on KNL, but there is no documentation on the various box numbers are located on the mesh. There is some evidence that the CHA box numbers are re-mapped -- on a 32-tile/64-core Xeon Phi 7210 all 38 CHAs are active, but the six CHA's with anomalous readings are numbered 32-37. The missing core APIC IDs are not bunched up in this way.

The stacked memory modules have slightly higher latency because they are typically run in "closed page" mode, and because there is an extra set of chip-to-chip crossings. HMC (and Intel's MCDRAM) have an extra SERDES step between the memory stack and the processor chip. There are many different approaches used to error-checking on SERDES, but it is probably safe to expect that error-checking will require at least some added latency.

 
thread Test results for Knights Landing new - Agner - 2016-11-26
reply Test results for Knights Landing new - Nathan Kurz - 2016-11-26
replythread Test results for Knights Landing new - Tom Forsyth - 2016-11-27
reply Test results for Knights Landing new - Søren Egmose - 2016-11-27
last reply Test results for Knights Landing new - Agner - 2016-11-30
replythread Test results for Knights Landing new - Joe Duarte - 2016-12-03
replythread Test results for Knights Landing new - Agner - 2016-12-04
last reply Test results for Knights Landing new - Constantinos Evangelinos - 2016-12-05
last replythread Test results for Knights Landing new - John McCalpin - 2016-12-06
replythread Test results for Knights Landing new - Agner - 2016-12-06
last reply Test results for Knights Landing - John McCalpin - 2016-12-08
last reply Test results for Knights Landing new - Joe Duarte - 2016-12-07
replythread Test results for Knights Landing new - zboson - 2016-12-28
last reply VZEROUPPER new - Agner - 2016-12-28
replythread Test results for Knights Landing new - Ioan Hadade - 2017-07-13
last reply Test results for Knights Landing new - Agner - 2017-07-13
last replythread INC/DEC throughput new - Peter Cordes - 2017-10-09
last reply INC/DEC throughput new - Agner - 2017-10-10