Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog |

Test results for Knights Landing
Author:  Date: 2016-11-26 19:07
Thanks for publishing this! Some comments and questions:

> The Knights Landing has full out-of-order capabilities

It may be worth mentioning that the Memory Execution Cluster is still in order. The Intel Architectures Optimization Reference Manual says "The MEC has limited capability in executing uops out-of-order. Specifically, memory uops are dispatched from the scheduler in-order, but can complete in any order. " More specifically, they mean "in order with regard to other memory operations": unlike other modern Xeons, a memory operation with an unfulfilled dependency will block successive memory operations from being dispatched even if their dependencies are ready.

> The reservation stations have 2x12 entries for the integer lines, 2x20 entries
> for the floating point and vector lines, and 12 for the memory lines.

The manual says "The single MEC reservation station has 12 entries, and dispatches up to 2 uops per cycle." I think this means that both 'memory lines' share a single 12 slot queue this is used for both scalar and vector operations? If so, might be good to be more explicit.

> The Knights Landing has two decoders. The maximum throughput is two
> instructions or 16 bytes per clock cycle.

The manual says "The front end can fetch 16 bytes of instructions per cycle. The decoders can decode up to two instructions of not more than 24 bytes in a cycle. ", and then later says "The total length of the instruction bytes that can be decoded each cycle is at most 16 bytes per cycle with instructions not more than 8 bytes in length. For instruction length exceeding 8 bytes, only one instruction per cycle is decoded on decoder 0." I haven't figured out when the 24B limit would apply. Have you? Also, is it correct that these limits are not affected by alignment?

> The throughput is limited to two instructions per clock cycle
> in the decode and register rename stages.
> the average throughput is limited to two μops per clock cycle.
> Read-modify, and read-modify-write instructions generate a single μop
> from the decoders, which is sent to both the memory unit and the execution unit.

By emphasizing that it's a 'single µop', do you mean that the same µop is first sent to the memory unit and then, when the data is ready, (Figure 16-2) sent to the Integer or FP rename buffer as appropriate? And thus since the renamer can handle only two instructions per cycle, this means that unlike other Xeon the use of 'read-modify' (aka 'load-op') instructions does not usually help to increase µop throughput?

> There is no penalty for the 2- and 3-byte VEX prefixes and 4-byte EVEX prefixes
> unless there are additional prefixes before these.

It's fairly easy to check that all the instructions are encoded to less than 8B, but I'm not sure I know how to correctly count the prefixes. Do you have any examples of common cases where this would be a problem?

> The Knights Landing has no loop buffer, unlike the Silvermont. This means
> that the decoding of instructions is a very likely bottleneck, even for small loops.

The corollary to this is that loop unrolling can be a win on KNL even when it would be counterproductive on a Xeon with a loop buffer.

> These are forked after the register allocate and renaming stages into two the integer unit

I think there is a missing word after 'two'?

> A 64-bit register can be cleared by xor'ing the corresponding 32-bit register with itself.

Hmm, interesting find. Do you see any reason they would choose to recognize the 32-bit idiom but not the 64-bit? Or just oversight?

> The processor can do two memory reads per clock cycle or one read and
> one write with vector registers of up to 512 bits. It cannot do two reads with
> general purpose registers in the same clock cycle, but it can do one
> read and one write.

I can't tell from this (or the Intel manual) which "vector plus scalar" memory operations are possible in the same cycle. Do you know if it can read a vector and a scalar in the same cycle? Read one and write the other?

> The latency from the mask register, k1, to the destination is 2 clock cycles.
> The latency from zmm1 input to zmm1 output is 2 clock cycles.

I'm not understanding what you are measuring here. Are you saying that if have an instruction that changes the mask or the input, immediately followed by a masked read, that you will have an effective load latency of 2 + 5? Or something else?

> The {z} option makes the elements in the destination zero, rather than
> unchanged, when the corresponding mask bit is zero. It is recommended
> to use the zeroing option when possible to avoid the dependence on the previous value.

Does this dependency occur only when explicitly using a mask, or does it matter in other cases too? That is, does {z} make a difference without a mask?

> The best solution in such cases may be to turn off hyper-threading in the BIOS setup.

Do you think there is a difference between turning off hyper-threading in BIOS, rather than telling the OS to disable someo f the cores at runtime? The manual says "Hard partitioning of resources changes as logical processors wake up and go to sleep", which makes me think that OS level would be fine, although I'm not sure exactly what it means by 'sleep'.

> 15.11 Bottlenecks in Knights Landing

As shown in your Instruction Table, the extreme cost of some of the "byte oriented" vector instructions might be worth calling out. The ones that stood out for me were VPSHUFB y,y,y with latency 23 and reciprocal throughput 12 (versus 1 and 1 for Skylake) and PMOVMSKB r32,y with latency 26 and reciprocal throughput 12 (versus 2-3 and 1 for Skylake). Specific mention of some of these might be helpful, since while these are still technically supported, it's unlikely that you'd want to use an algorithm that depends on them.

There are a couple passages in the manual that I haven't been able to decipher. Do you perhaps understand this one, which (among other things) pertains to shuffles and permutes? "Some execution units in the VPU may incur scheduling delay if a sequence of dependent uop flow needs to use these execution units, these outlier units are indicated by the footnote of Table 16 2. When this happens, it will have an additional cost of a 2-cycle bubble."

The other line that scares me in the manual is this one: "The decoder will also have a small delay if a taken branch is encountered." Did you happen to figure out how long this delay is?

Thanks again for making this great research available! I'm sure a lot of people will greatly appreciate it.

thread Test results for Knights Landing new - Agner - 2016-11-26
reply Test results for Knights Landing - Nathan Kurz - 2016-11-26
replythread Test results for Knights Landing new - Tom Forsyth - 2016-11-27
reply Test results for Knights Landing new - Søren Egmose - 2016-11-27
last reply Test results for Knights Landing new - Agner - 2016-11-30
replythread Test results for Knights Landing new - Joe Duarte - 2016-12-03
replythread Test results for Knights Landing new - Agner - 2016-12-04
last reply Test results for Knights Landing new - Constantinos Evangelinos - 2016-12-05
last replythread Test results for Knights Landing new - John McCalpin - 2016-12-06
replythread Test results for Knights Landing new - Agner - 2016-12-06
last reply Test results for Knights Landing new - John McCalpin - 2016-12-08
last reply Test results for Knights Landing new - Joe Duarte - 2016-12-07
replythread Test results for Knights Landing new - zboson - 2016-12-28
last reply VZEROUPPER new - Agner - 2016-12-28
replythread Test results for Knights Landing new - Ioan Hadade - 2017-07-13
last reply Test results for Knights Landing new - Agner - 2017-07-13
last replythread INC/DEC throughput new - Peter Cordes - 2017-10-09
last reply INC/DEC throughput new - Agner - 2017-10-10