Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Future instruction set: AVX-512 - Agner - 2013-10-09
replythread Future instruction set: AVX-512 - Elhardt - 2013-10-25
last reply Future instruction set: AVX-512 - Agner - 2013-10-26
replythread Future instruction set: AVX-512 - Agner - 2014-10-08
replythread AVX512 Instruction Timing for Knigths Landing - Jorcy Neto - 2016-06-21
last replythread AVX512 Instruction Timing for Knigths Landing - Agner - 2016-06-22
last replythread AVX512 Instruction Timing for Knigths Landing - Jorcy Neto - 2016-06-23
last reply AVX512 Instruction Timing for Knigths Landing - Jorcy Neto - 2016-08-30
last replythread Future “vector+SIMD” extensions over AVX-512 - Jorcy Neto - 2016-11-18
last replythread Future “vector+SIMD” extensions over AVX-512 - Agner - 2016-11-18
last replythread Future “vector+SIMD” extensions over AVX-512 - Jorcy Neto - 2017-06-21
last replythread Future “vector+SIMD” extensions over AVX-512 - Jorcy Neto - 2017-06-26
last reply Future “vector+SIMD” extensions over AVX-512 - Jorcy Neto - 2017-08-24
last reply Future instruction set: AVX-512 - - - 2017-10-20
 
Future instruction set: AVX-512
Author: Agner Date: 2013-10-09 09:36

Intel have announced the next big instruction set extension, AVX512 to be implemented in 2015 or 2016. The details are defined in Intel Architecture Instruction Set Extensions Programming Reference. There are many interesting extensions:

  • The size of vector registers are extended from 256 bits (YMM registers) to 512 bits (ZMM) registers. There is room for further extensions to at least 1024 bits (what will they be called?)
  • The number of vector registers is doubled to 32 registers in 64-bit mode. There will still be only 8 vector registers in 32-bit mode.
  • Eight new mask registers k0 - k7 allow masked and conditional operations. Most vector instructions can be masked so that it only operates on selected vector elements while the remaining vector elements are unchanged or zeroed. This will replace the use of vector registers as masks.
  • Most vector instructions with a memory operand have an option for broadcasting a scalar operand.
  • Floating point vector instructions have options for specifying the rounding mode and for suppressing exceptions.
  • There is a new addressing mode called compressed displacement. Where instructions have a memory operand with a pointer and an 8-bit sign-extended displacement, the displacement is multiplied by the size of the operand. This makes it possible to address a larger interval with just a single byte displacement as long as the memory operands are properly aligned. This makes the instructions smaller in some cases to compensate for the longer prefix.
  • More than 100 new instructions
  • The 512-bit registers can do vector operations on 32-bit and 64-bit signed and unsigned integers and single and double precision floats, but unfortunately not on 8-bit and 16-bit integers.

A year ago, Intel announced a similar instruction set with 512-bit registers in Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual. The two instruction sets are very similar, both are backwards compatible, but they are not compatible with each other. The two instruction sets differ by a single prefix bit, even for otherwise identical instructions. I assume that the Knights Corner or Xeon Phi instruction set will have a short life and be replaced by AVX512.

The AVX512 instruction set uses a new 4-bytes prefix named EVEX, which is similar to the 2- or 3-bytes VEX prefix, but with 62 (hexadecimal) as the first byte. (Actually, I predicted several years ago that the 62 byte would be used for such a prefix because it was the only remaining byte that could be used in the same way as the VEX prefix bytes). The extra bits in the EVEX prefix are used for doubling the number of registers, for specifying vector size, and for the extra features of broadcasting, masking, zeroing, specifying rounding mode, and suppressing floating point exceptions.

The calling conventions for the new registers are partially defined in a draft ABI, but it is still discussed whether the new registers should have callee save status, see Gnu libc-alpha mailing list.

I have commented on the AVX512 instruction set and suggested various improvements at Intel's blog and Intel's forum.

The new instruction sets are supported by my objconv disassembler.

   
Future instruction set: AVX-512
Author: Elhardt Date: 2013-10-25 15:57
Hello Agner. You've mentioned most of the important improvements that AVX512 with bring us. However, you've missed an important one that I also think should have been mentioned. AVX512 will include reciprocal estimates that are accurate to 2 ^ -28. That means for single precision floating point, no time consuming Newton-Raphson refinement needs to be done. This can be a major speed boost for division ( and square roots also have the more accurate estimation too ). Intel's divisions have gotten a lot faster over the years to the point where they appear to be faster than the reciprocal / Newton-Raphson method. But now it looks like using the new reciprocal estimation is a way to leap ahead of divide instructions to gain more speed again.
   
Future instruction set: AVX-512
Author: Agner Date: 2013-10-26 01:47
Elhardt wrote:
AVX512 will include reciprocal estimates that are accurate to 2 ^ -28.

AVX512 will have instructions for calculating reciprocals and reciprocal squareroot with a precision of 2-14. A subsequent AVX512ER have reciprocals and reciprocal squareroot with a precision of 2-28 and exponential function with a precision of 2-23.

   
Future instruction set: AVX-512
Author: Agner Date: 2014-10-08 11:02
Agner wrote:
The 512-bit registers can do vector operations on 32-bit and 64-bit signed and unsigned integers and single and double precision floats, but unfortunately not on 8-bit and 16-bit integers.
The latest update of Intel's manual specifies a future instruction set named AVX512BW which has vectors of 32 16-bit integers or 64 8-bit integers. See software.intel.com/en-us/intel-isa-extensions.

The AVX512 instruction set will be divided into several subsets: AVX512BW for vector instructions with 8-bit (Byte) and 16-bit (Word) granularity; AVX512DQ for 32-bit (Dword or float) and 64-bit (Qword or double) granularity; AVX512VL for the same instructions with 128 bit and 256 bit total vector length; and various other subsets.

The Skylake processor, planned for 2015, will probably support all these subsets, while the Knights Landing multiprocessor will not support the BW subset, according to this announcement software.intel.com/en-us/blogs/additional-avx-512-instructions.

A 512-bit vector with 8-bit granularity will have 64 elements and require 64-bit mask registers. The mask registers are officially 64-bit architectural registers, according to the manual. It is not clear what architectural means, but it usually means something that is guaranteed to be supported in future processors. This raises the question about the possibility of future extensions. If future extensions to 1024 or 2048 bit vectors will support 8-bit and 16-bit granularity then the mask registers must be bigger so that they can no longer communicate nicely with the 64-bit general purpose registers. If there will be future extensions of the vector size at all, either they will have only 32-bit and 64-bit granularity, or the mask registers will have to be redesigned.

   
AVX512 Instruction Timing for Knigths Landing
Author:  Date: 2016-06-21 13:23
Now that Knights Landing is officially within Intel's lineup, I'm feeling quite curious about the first AVX512 VPUs performance.
Maybe, since that AVX512 behaves much like an alias for the original IMCI ISA, the timing measuraments for KNL's VPUs (except for the doubled throughput) would't differ much from those from the KNC VPU

Ref : "Test-driving intel xeon phi" : https://research.spec.org/icpe_proceedings/2014/p137.pdf -> It mentions the usage of a very similar method as the one for the x86 Instruction Timing tables. One interesting thing to notice is that even the Vector Logical instructions have at least a 2 cycle latency, proabably due to the vector mask stage.

   
AVX512 Instruction Timing for Knigths Landing
Author: Agner Date: 2016-06-22 11:25
Jorcy Neto wrote:
Maybe, since that AVX512 behaves much like an alias for the original IMCI ISA, the timing measuraments for KNL's VPUs (except for the doubled throughput) would't differ much from those from the KNC VPU.
Knights Landing is expected to be much better than Knights Corner. It has a very different microarchitecture.
   
AVX512 Instruction Timing for Knigths Landing
Author:  Date: 2016-06-23 05:50
Agner wrote:
Jorcy Neto wrote:
Maybe, since that AVX512 behaves much like an alias for the original IMCI ISA, the timing measuraments for KNL's VPUs (except for the doubled throughput) would't differ much from those from the KNC VPU.
Knights Landing is expected to be much better than Knights Corner. It has a very different microarchitecture.
Indeed, the tighter integration of the VPUs with the Silvermont core frontend will improve on the overall pipeline latency, however, few details were disclosed about the VPU itself. I was also wondering about legacy SIMD performance as only a single VPU is used, opposed to the typical Port 0 + Port 1 configuration on maistream Xeon processors (NHM/SNB//HSW/SKL).

www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.25-Tuesday-Epub/HC27.25.70-Processors-Epub/HC27.25.710-Knights-Landing-Sodani-Intel.pdf

   
AVX512 Instruction Timing for Knigths Landing
Author:  Date: 2016-08-30 06:46
Well. I've found that the former-Intel James Reinders has actually made available (some of the most relevant, not quite detailed yet) KNL Instructions Timing on his latest book : "Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition".

https://books.google.com.br/books?id=DDpUCwAAQBAJ&pg=PA124&lpg=PA124&dq=instruction+latency+tables&source=bl&ots=YiJvez2v8H&sig=GCy_nTol6YS_8EpsIhRO9yJW_Is&hl=pt-BR&sa=X&ved=0ahUKEwidy774henOAhVMGZAKHR9VAj8Q6AEIZDAI#v=onepage&q=instruction%20latency%20tables&f=false

I've just bought my full copy from Elsevier and I haven't made through the whole book yet, however Chapter 6 (uArch Optimization Advide) has caught most of my attention so far.

   
Future “vector+SIMD” extensions over AVX-512
Author:  Date: 2016-11-18 11:06
Earlier this month, Professor McCalpin has made some reviews on his blog ( sites.utexas.edu/jdm4372/2016/11/05/intel-discloses-vectorsimd-instructions-for-future-processors/ ) about the Oct/2016 release of Intel’s Instruction Extensions Programming Reference ( https://software.intel.com/sites/default/files/managed/26/40/319433-026.pdf ) which has now disclosed a few new "vector+SIMD" instructions, as he called, as now they can operate on consecutive SIMD registers. i.e: Multiple operations as both simultaneous (SIMD) and consecutive (Vector).

An example of a DGEMM was given on the new V4FMADDPS instruction, which performs 4 consecutive multiply-accumulate operations with a single 512-bit accumulator register, four different (consecutively-numbered) 512-bit input registers, and four (consecutive) 32-bit memory values from memory

   
Future “vector+SIMD” extensions over AVX-512
Author: Agner Date: 2016-11-18 11:42
McCalpin is surprised because this will "break the fundamental architectural paradigm", and so am I. It will be quite complicated to implement in hardware. I have wondered what the purpose of these instructions was, and McCalpin seems to have the answer. If these instructions are implemented in a successor of Knights Landing, they should be single micro-op because the Knights Landing has poor performance of microcode.

The new instructions are supported in my disassembler (objconv), but I am not sure about the assembly notation.

   
Future “vector+SIMD” extensions over AVX-512
Author:  Date: 2017-06-21 12:50
Agner wrote:
McCalpin is surprised because this will "break the fundamental architectural paradigm", and so am I. It will be quite complicated to implement in hardware. I have wondered what the purpose of these instructions was, and McCalpin seems to have the answer. If these instructions are implemented in a successor of Knights Landing, they should be single micro-op because the Knights Landing has poor performance of microcode.

The new instructions are supported in my disassembler (objconv), but I am not sure about the assembly notation.

The AVX-512-4VNNIW ( Vector Neural Network Instructions Word variable precision ) which will firstly appear on Kights Mill seems also to extend VDPP(S/D) into the same vector+SIMD philosophy, although using Words/Doublewords (very DSP-like on the use of a higher precision accumulator) instead of Single/Double.

https://en.wikipedia.org/wiki/AVX-512#New_instructions_in_AVX-512_4FMAPS_and_4VNNIW

   
Future “vector+SIMD” extensions over AVX-512
Author:  Date: 2017-06-26 09:24
The following presentation gives some further detailing on QFMA and QVNNI, and also speculates about future Knights Mill performance, which is supposed to have a 2 cycle throughput for a single QFMA on each port (see page 10).

https://indico.cern.ch/event/595059/contributions/2499304/attachments/1430242/2196659/Intel_and_ML_Talk_HansPabst.pdf

   
Future “vector+SIMD” extensions over AVX-512
Author:  Date: 2017-08-24 09:05
Guess for KNL, we'll see an unusual 0.5x DP vs. 4x SP/INT32 "improvement" scheme due to QVNNI.

www.anandtech.com/show/11741/hot-chips-intel-knights-mill-live-blog-445pm-pt-1145pm-utc

   
Future instruction set: AVX-512
Author: - Date: 2017-10-20 05:54
Looks like Intel updated their manual to add a few more extensions for upcoming Icelake processors (probably releasing around 2019): https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf

Icelake will supposedly support:

* AVX512_VNNI: Vector Neural Network Instructions
* AVX512_VBMI2: Additional AVX512 Vector Bit Manipulation Instructions
* AVX512_VPOPCNTDQ: Vector POPCNT
* AVX512_BITALG: Support for VPOPCNT[B,W] and VPSHUF-BITQMB
* AVX512+VAES: Vector AES
* AVX512+GFNI: Galois Field New Instructions
* AVX512+VPCLMULQDQ: Carry-Less Multiplication Quadword