Agner`s CPU blog

Test results for Knights Landing

Author: Agner

Date: 2016-11-30 04:33

Tom Forsyth wrote:

Looks like a typo in the KNL recip-throughput number for FMA - it's currently 3. KNF and KNC get 1, and this chip is a real FMA machine - it's designed around that unit. Pretty sure the correct number for KNL is 0.5 (like VADDPS and VMULPS).

You are right. My mistake. Fused multiply-and-add has a throughput of two instructions per clock.

As for why the chip has 4 threads per core - I'm the guy that persuaded KNF (and thus KNC) to have 4 threads, and the reason is first to hide memory misses, second branch mispredicts, and third instruction latencies. Those are all huge bottlenecks in real-world performance. Yes, you can also hide them with huge OOO machines, wide decoders, and long pipelines, but when flops/watt is your efficiency metric, those aren't the first choice.

Thanks for clarifying the reason. Running 4 threads in an in-order core makes sense. Do you think that 4 threads is still useful in the out-of-order KNL?

Nathan Kurz wrote:

The Knights Landing has full out-of-order capabilities
It may be worth mentioning that the Memory Execution Cluster is still in order. The Intel Architectures Optimization Reference Manual says "The MEC has limited capability in executing uops out-of-order. Specifically, memory uops are dispatched from the scheduler in-order, but can complete in any order. " More specifically, they mean "in order with regard to other memory operations":

Memory operations are scheduled in order but executed out of order, as I understand it.

The manual says "The front end can fetch 16 bytes of instructions per cycle. The decoders can decode up to two instructions of not more than 24 bytes in a cycle. ", and then later says "The total length of the instruction bytes that can be decoded each cycle is at most 16 bytes per cycle with instructions not more than 8 bytes in length. For instruction length exceeding 8 bytes, only one instruction per cycle is decoded on decoder 0." I haven't figured out when the 24B limit would apply. Have you? Also, is it correct that these limits are not affected by alignment?

I just tried. A block of two instructions of 24 bytes total can decode in a single clock cycle, but you cannot have consecutive blocks exceeding 16 bytes, and the average cannot exceed 16 bytes per clock. And yes, alignment matters. Decoding is most efficient when aligned by 16. There is probably a double buffer of 2x16 bytes.

Read-modify, and read-modify-write instructions generate a single Î¼op from the decoders, which is sent to both the memory unit and the execution unit.
By emphasizing that it's a 'single Âµop', do you mean that the same Âµop is first sent to the memory unit and then, when the data is ready, sent to the Integer or FP rename buffer as appropriate?

Yes, that's how I understand it.

There is no penalty for the 2- and 3-byte VEX prefixes and 4-byte EVEX prefixes unless there are additional prefixes before these.
It's fairly easy to check that all the instructions are encoded to less than 8B, but I'm not sure I know how to correctly count the prefixes. Do you have any examples of common cases where this would be a problem?

You do not normally need any additional prefixes in front of VEX and EVEX prefixes. The only case is FS and GS segment prefixes for the thread environment blocks, and these blocks are usually accessed with integer instructions. But instructions without VEX and EVEX can be a mess of prefixes. Legacy SSSE3 instructions without VEX prefix all have a meaningless 66H prefix and a 2-byte escape code (0FH, 38H). An additional REX prefix is needed if register r8-r15 or xmm8-xmm15 is used. This gives a total of 4 prefix and escape bytes, which is more than the decoder can handle in a single clock cycle. Another example is the ADCX and ADOX instructions with 64-bit registers.

A 64-bit register can be cleared by xor'ing the corresponding 32-bit register with itself.
Hmm, interesting find. Do you see any reason they would choose to recognize the 32-bit idiom but not the 64-bit? Or just oversight?

An optimizing compiler would use xor eax,eax rather than xor rax,rax because the former is shorter. But it may be an oversight. There is no difference in length between xor r8,r8 and xor r8d,r8d.

I can't tell from this (or the Intel manual) which "vector plus scalar" memory operations are possible in the same cycle. Do you know if it can read a vector and a scalar in the same cycle?

Yes, it can do a vector read and a g.p. register read in the same clock cycle. It can also do any combination of a read and a write.

The latency from the mask register, k1, to the destination is 2 clock cycles. The latency from zmm1 input to zmm1 output is 2 clock cycles. I'm not understanding what you are measuring here. Are you saying that if have an instruction that changes the mask or the input, immediately followed by a masked read, that you will have an effective load latency of 2 + 5? Or something else?

The latencies come in parallel, not in series, so the latency from a mask or destination register will be 2 if the memory operand is ready.

The {z} option makes the elements in the destination zero, rather than unchanged, when the corresponding mask bit is zero. It is recommended to use the zeroing option when possible to avoid the dependence on the previous value. Does this dependency occur only when explicitly using a mask, or does it matter in other cases too? That is, does {z} make a difference without a mask?

A masked instruction has a dependency on the destination register if there is a mask and no {z}. There is no dependency on the destination register if there is no mask or if there is a mask with a {z}. It doesn't make sense to put {z} when there is no mask.

The best solution in such cases may be to turn off hyper-threading in the BIOS setup. Do you think there is a difference between turning off hyper-threading in BIOS, rather than telling the OS to disable some of the cores at runtime?

There is no difference. The problem is that current Operating Systems are not handling hyper-threading optimally. In fact, the O.S. lacks the necessary information. The CPUID instruction can tell how many threads are sharing the same L1 or L2 cache, but it does not tell which threads are sharing a cache, and it does not tell if threads are sharing decoders, execution units, and other resources. The proper solution would be to make the CPUID instruction tell which resources are shared between which threads, and make the operating system give high priority threads an unshared core of their own. But still, the O.S. does not know what is the bottleneck in each thread. It may be OK to put two threads in the same core if the bottleneck is memory access, but not if the bottleneck is instruction decoding. The application programmer or the end user cannot control this either if programs are running in a multiuser system. The hardware designers have given an impossible task to the software developers.

Do you perhaps understand this one, which (among other things) pertains to shuffles and permutes? "Some execution units in the VPU may incur scheduling delay if a sequence of dependent uop flow needs to use these execution units, these outlier units are indicated by the footnote of Table 16 2. When this happens, it will have an additional cost of a 2-cycle bubble"

There is an extra latency of 1-2 clock cycles when, for example, the output of an addition instruction goes to the input of a shuffle instruction. The data have to travel a longer distance to these units that they call outliers. There is less latency when the output of a shuffle instruction goes to the input of another shuffle instruction.