Agner`s CPU blog

Test of Skylake-X and Goldmont

Author: Agner

Date: 2018-04-25 11:51

The Skylake-X processor is a Skylake processor with added support for the new instruction set AVX512, including AVX512BW, AVX512DQ, AVX512VL, and AVX512CD.

The performance is identical to earlier Skylakes except for the new AVX512 instructions, different cache sizes, and different number of CPU cores.

The AVX512 instruction set doubles the number of vector registers to 32 and doubles the size of these vector registers to 512 bits. The bigger vector registers let you have sixteen single precision floats in one vector register. AVX512 also allows conditional execution of selected elements in a vector by the use of seven new mask registers. This is useful if the code contains branches.

The new Skylake variants have three vector execution units. Two of these units are 256 bits and the last is 512 bits. The two 256-bit units are combined when executing a 512 bit vector. This means that you can do two 512-bit vector calculations per clock cycle.

Floating point addition, multiplication, and fused multiply-and-add (FMA) instructions have a throughput of two 512-bit vectors on some versions and one on other versions. Early Skylake-X processors with less than ten cores can do one 512-bit floating point instruction per clock, while Skylake-X with 10+ cores and newer Skylakes can do two. They can all do two 256-bit floating point vector instructions per clock. These processors are otherwise identical and they all have the same CPUID family number so they are difficult to distinguish by software. I guess Intel has given high priority to FMA instructions on the biggest processors so that they can boast of a high FLOPS measure.

Historically, the first microprocessor version to support a new bigger vector size has usually had poor performance because it simply used half-size execution units twice when processing the big vector. With this history in mind, the Skylake-X is actually better than expected. Many 512-bit vector instructions have the same latency and throughput as the corresponding 256-bit instructions. Simple integer vector instructions, such as addition, has a throughput of three instructions per clock for 256 bits and two instructions per clock for 512 bits. But I think it is rare that you will have three vector instructions issued simultaneously anyway because there are likely to be narrower bottlenecks elsewhere, such as cache access or instruction decoding.

So the conclusion is that you can speed up CPU-intensive code by almost a factor two by using the new 512-bit vector registers if memory access and caching is not a bottleneck. There is not much software that supports AVX512 yet so you will have to compile the code yourself to get this performance boost. But don't be surprised if the result is disappointing. Instruction decoding and cache access are still very likely to be the limiting factors.

The Intel Goldmont is a successor of Atom and Silvermont. These are small low-power processors for less demanding applications. The Goldmont has full out-of-order execution and a maximum throughput of three instructions per clock cycle. This is certainly enough for a lot of applications, such as small portable computers and also for low traffic servers where power consumption is an issue. The Goldmont does not support the higher instruction set extensions. The vector registers are only 128 bits and the highest instruction set it supports is SSE4.2. There is no AVX, AVX2, or AVX512. It does have the encryption instructions AES and SHA, though.

All the details are in my microarchitecture manual and my instruction tables: agner.org/optimize/#manuals

[Edited: This posting originally said something wrong about Cannon Lake. Please disregard all comments about Cannon Lake in this thread]

Reply To This Message

Next Message