Intel AVX10 & APX announcement

Vladislav_152378 · Post by **Vladislav_152378** » 2023-07-30, 2:50:37

Intel has announced several new x86-64 instruction set expansions: AVX10 (AVX10.1 and AVX10.2) and completely new APX.

The converged AVX10 ISA will include "AVX-512 vector instructions with an AVX512VL feature flag, a maximum vector register length of 256 bits, as well as eight 32-bit mask registers and new versions of 256-bit instructions supporting embedded rounding," and this version will run on both p-cores and e-cores.

However, the e-cores will be limited to the converged AVX10's maximum 256-bit vector length, while P-cores can use 512-bit vectors. This feels akin to Arm's support for variable vector widths with SVE.

Chips arriving after Granite Rapids will support AVX10.2, which adds support for the converged 256-bit vector lengths and other new features, like new AI data types and conversions, data movement optimizations, and standards support. All future Xeon processors will continue fully supporting all AVX-512 instructions to ensure that legacy apps function normally. Intel will freeze the AVX-512 ISA when AVX10 debuts, and all future use of AVX-512 instructions will occur through the AVX10 ISA. To address developer feedback (obviously negative), Intel also plans to significantly simplify its AVX10 enumeration methods

Intel also announced the new APX (Advanced Performance Extensions) today (not to be confused with the old-school iAPX 432). Intel claims APX-compiled code contains 10% fewer loads and 20% fewer stores than the same code compiled for an Intel 64 baseline. Intel also says that register accesses are both faster and consume significantly less dynamic power than complex load and store operations. Interestingly, the new APX finds a new use for the 128B area that was left unused when Intel abandoned MPX back in 2019, and repurposes it for XSAVE.

Intel claims it has implemented APX in such a way that it will not impact the silicon area or power consumption of the CPU core. Here are APX's top-level features:
16 additional general-purpose registers (GPRs) R16–R31, also referred to as Extended GPRs (EGPRs) in this document

Three-operand instruction formats with a new data destination (NDD) register for many integer instructions

Conditional ISA improvements: New conditional load, store and compare instructions, combined with an option for the compiler to suppress the status flags writes of common instructions

Optimized register state save/restore operations

A new 64-bit absolute direct jump instruction

For anyone confused: AVX10.2 is code-compatible with AVX512, but not binary-compatible. Meaning developers will have to recompile AVX512 code to run on AVX10.2 chips.

I have mixed feelings about it. On the one hand, it's a good that Intel has found how to support 512-bit vectors. On the other hand, AVX10 is just a way to bail out the Alder Lake chimera. Why would any other CPU vendor want to adopt this ISA extension? AMD already have AVX512 up and running without any tricks. No one except Intel is interested as solving Intel's problems. The AVX10.2 lacks any innovations others would like to adopt.

Let's look at variable vector width comparison from Tomshardware, for example. In ARM vector can vary between 128 and 2048 bits. In Intel AVX10 512-bit vector won't crush on core with 256-bit FPU, nothing more. We have no promise that in 10 years 1024-bit AVX1024 vector won't crush on 512-FPU. And so we'll have to reinvent the wheel for another time, again.

APX arguably deserves a more positive feedback. The claim about "no additional silicon area or power consumption" is particularly interesting. Up to 10% fewer loads and 20% fewer stores merely at a cost of zero transistors and one compiler flag? Sounds almost like a magic. I didn't get a clear picture on how it will (or will not?) overlap with all existing AVX stuff we already have even after reading Intel's paper, though. So have no estimates on how really useful this ISA will be.

What do you guys think?

References
1. Intel's New AVX10 Brings AVX-512 Capabilities to E-Cores
https://www.tomshardware.com/news/intel ... to-e-cores
2. Intel Unveils AVX10 and APX Instruction Sets: Unifying AVX-512 For Hybrid Architectures
https://www.anandtech.com/show/18975/in ... hitectures
3. Introducing Intel® Advanced Performance Extensions (Intel® APX)
https://www.intel.com/content/www/us/en ... s-apx.html

Post by **agner** » 2023-07-30, 6:41:13

Thanks for the links.

As far as I can see from the manuals, the future AVX10.2 processors will be binary compatible with existing AVX512 code. You only have to recompile the code if you want to use the extra registers and new instructions. The advantages of the new features are limited, so I don't expect many software developers to make an AVX10.2 version of their code.

The new registers and features are coded by using three previously unused bits in the EVEX prefix. This includes one bit that indicated instructions for the now obsolete Xeon Phi processor (a predecessor of the many-core processor Knights Landing). This means that the Xeon Phi instruction set is dead. Emulators and disassemblers will be unable to support it unless some feature flag is enabled.

The AVX10.2 extensions are adding yet another set of patches to the already very messy instruction set. Some bits are re-purposed and have different meanings in different contexts. A new prefix named REX2 is added to give access to 16 extra general purpose registers. REX2 is using the byte value of an obsolete instruction (named AAD) that is not available in 64 bits mode. There are many such instructions, so there is plenty of space for new prefixes in the future.

The EVEX prefix gets more messy because of poor foresight. When Intel added the extra vector registers ZMM16-ZMM31, they could have prepared for future extra general purpose registers coded in the same bits, but they used these bits for other purposes so that a new patch is needed today.

The decoding of instructions is already so complicated that it is a bottleneck in current processors. That's why they need a micro-operations cache to store decoded instructions. The new REX2 prefix only adds to the complexity.

... · Post by **...** » 2023-07-30, 10:32:37

Responding to "What do you guys think?":

Vladislav_152378 wrote: ↑
2023-07-30, 2:50:37
Why would any other CPU vendor want to adopt this ISA extension?

AFAIK AVX10.1 is just Sapphire Rapids' level AVX-512 renamed, with some new CPUID bits. I don't see why AMD (assuming they adopt FP16) would choose not to support it.

AVX10.2 doesn't seem to change all that much over AVX10.1, so it doesn't sound like much of a burden on AMD. They still get to benefit from supporting AVX10/512, whilst Intel E-cores will be stuck at AVX10/256.

Vladislav_152378 wrote: ↑
2023-07-30, 2:50:37
We have no promise that in 10 years 1024-bit AVX1024 vector won't crush on 512-FPU. And so we'll have to reinvent the wheel for another time, again.

Who knows what other innovations arise in 10 years time...
AVX brought 3-operand syntax + instructions defined to zero untouched parts of the vector (allowing wider vectors).
AVX-512 brought +16 vector registers, mask registers, embedded rounding etc.

If Intel decided to go with variable length vectors when they came up with VEX, they still would've needed developers to 'rewrite' for EVEX, lest they abandon the new AVX-512 capabilities.
SVE/2 might support scaling vector width (at the expense of being more difficult to program), but that doesn't automatically allow existing SVE code to adopt new paradigms.

Vladislav_152378 wrote: ↑
2023-07-30, 2:50:37
APX arguably deserves a more positive feedback.

I feel it may have been more interesting if they didn't try so hard to shoehorn it into x86-64, and instead took a bolder step and focused more on an encoding to address x86's key shortcoming of decoder complexity.
Even if APX can achieve 10% better performance in typical scenarios (which sounds doubtful), I doubt many developers would go to great lengths for such. Meaning that one would most likely just target a whole binary at APX (over trying to do function-level runtime detection), just like one targets it for x86-32 and x86-64.
APX does enable mixing x86-64 libraries, allows a library to selectively use APX, and is likely useful for JIT, but I feel it's too incremental of a step.

Post by **agner** » 2023-07-30, 11:54:04

We have no promise that in 10 years 1024-bit AVX1024 vector won't crush on 512-FPU. And so we'll have to reinvent the wheel for another time, again.

The EVEX prefix used by AVX512 and AVX10 has space for extensions to 1024 bit vectors, but not 2048.

SVE/2 might support scaling vector width

ARM SVE/2 can vary vector length between 128 bits and 2048 bits, at 128-bit increments. I don't see any reason to adopt this in x86 since you can just mask off the unused part of a vector when saving it.

I feel it may have been more interesting if they didn't try so hard to shoehorn it into x86-64, and instead took a bolder step and focused more on an encoding to address x86's key shortcoming of decoder complexity.

Both Intel and AMD have a long history of making small additions to the instruction set for marketing reasons, giving customers a reason the buy the next microprocessor version. AVX10 looks like progress, but few software developers will care to use it because it is costly to maintain yet another branch of their software.

If Intel made a new architecture without backward compatibility they would have to make processors that support both instruction sets, or they would quickly be outcompeted by ARM. They have tried that before with the Itanium, and failed.

If you want a new architecture without all the disadvantages of x86, look at my experimental instruction set Forwardcom. It lets you extend vector lengths without recompiling.

Vladislav_152378 · Post by **Vladislav_152378** » 2023-07-30, 14:44:01

... wrote: ↑
2023-07-30, 10:32:37
I don't see why AMD (assuming they adopt FP16) would choose not to support it. AVX10.2 doesn't seem to change all that much over AVX10.1, so it doesn't sound like much of a burden on AMD.

Agner had mentioned decoder complexity issues. In Zen5, according to rumors, one of AMD focuses is on wider decoder. It mentioned here and here, for example. So the AVX10 can potentially conflict with AMD's engineers work on decoder improving. It's not hard to see why AMD may prefer to have better performance instead of compatibility with Intel.

... wrote: ↑
2023-07-30, 10:32:37
I feel it may have been more interesting if they didn't try so hard to shoehorn it into x86-64, and instead took a bolder step and focused more on an encoding to address x86's key shortcoming of decoder complexity.

How they would do that? Is this even possible without dropping legacy? The only dubious idea that comes to my mind is having 2 decoders to simulate fixed-length instruction length architecture. Main decoder decodes fixed-length instruction. The smaller decoder is for legacy variable-length instruction. Both decoders write to a shared uops cache. And some simple switch that directs legacy and non-legacy instruction to the corresponding decoders. My assumption is that fixed instruction length will boost decoding speed significant enough to be worth it. Yet I acknowledge that the very idea of 2 separate decoders is just wild.

agner wrote: ↑
2023-07-30, 11:54:04
The EVEX prefix used by AVX512 and AVX10 has space for extensions to 1024 bit vectors, but not 2048.

Does it mean they could have made a promise "we guarantee that 1024-bit vector won't crush on 256/512-bit FPU"? I wouldn't care about 2048-bit vectors since cache line is only 1024-bit. However AVX1024 is something we can realistically get around 2030. Not too far away to get prepared.

... · Post by **...** » 2023-07-31, 10:24:05

agner wrote: ↑
2023-07-30, 11:54:04
ARM SVE/2 can vary vector length between 128 bits and 2048 bits, at 128-bit increments. I don't see any reason to adopt this in x86 since you can just mask off the unused part of a vector when saving it.

Might be worth pointing out that ARM has published an errata (see C215) which restricts SVE vector widths to a power of two (IMO supporting non-pow2 widths was kinda silly), so you can't have a 384-bit vector, for example.
I don't get what you mean by masking off unused parts of a vector on x86 - it's not really the same thing as what SVE does.

agner wrote: ↑
2023-07-30, 11:54:04
They have tried that before with the Itanium, and failed.

Itanium isn't something I'm familiar with, but my understanding is that it's completely different to x86, and x86 code needed to be emulated. My (likely silly) idea would avoid emulation.

Vladislav_152378 wrote: ↑
2023-07-30, 14:44:01
Agner had mentioned decoder complexity issues. In Zen5, according to rumors, one of AMD focuses is on wider decoder.

AFAIK AVX10.1 requires no changes to the decoder. AVX10.2 should require minimal changes, which should have minimal impact on complexity (if any at all).
APX, on the other hand, does add decoder complexity.

I have a suspicion that there's confusion between AVX10 and APX. The two are entirely separate extensions that have little relation with each other.

Vladislav_152378 wrote: ↑
2023-07-30, 14:44:01
How they would do that? Is this even possible without dropping legacy? The only dubious idea that comes to my mind is having 2 decoders to simulate fixed-length instruction length architecture.

I'm not a hardware engineer, so can't answer that, but point out that many ARM CPUs can execute both AArch32 and AArch64, which are different instruction sets.
Presumably this is done by having two decode units, which can be switched for the respective execution modes.

Vladislav_152378 wrote: ↑
2023-07-30, 14:44:01
since cache line is only 1024-bit

Cache lines are typically 512-bit on current x86 CPUs.

Post by **agner** » 2023-07-31, 13:14:35

APX, on the other hand, does add decoder complexity.

X86 until AVX512 already has 15 - 18 different prefixes, depending on how you count. APX adds just one more prefix (REX2) and extends the number of uses of an existing one (EVEX). This is just an incremental increase in complexity. It should be possible to squeeze this into existing hardware designs. The bottleneck is calculating the length of each instruction, because this is a fundamentally serial process. Decoding what the bits mean is not critical because this can be done in parallel.

Might be worth pointing out that ARM has published an errata (see C215) which restricts SVE vector widths to a power of two (IMO supporting non-pow2 widths was kinda silly)

That's good news. You need power-of-2 sizes for efficient alignment.

Vladislav_152378 · Post by **Vladislav_152378** » 2023-07-31, 14:31:51

... wrote: ↑
2023-07-31, 10:24:05
Cache lines are typically 512-bit on current x86 CPUs.

Thanks for correction. James Reinders from Intel mentioned that implementation of AVX wider than cache line would be considerably more challenging. So 512-bit is the long-term boundary then, which means my initial critisism is less valid.

... wrote: ↑
2023-07-31, 10:24:05
many ARM CPUs can execute both AArch32 and AArch64, which are different instruction sets. Presumably this is done by having two decode units

Yes, it sounds pretty similar to my (or not my, apparently) idea. Very intriguing and doesn't require emulation as well.

Agner's CPU blog

Intel AVX10 & APX announcement

Intel AVX10 & APX announcement

Re: Intel AVX10 & APX announcement

Re: Intel AVX10 & APX announcement

Re: Intel AVX10 & APX announcement

Re: Intel AVX10 & APX announcement

Re: Intel AVX10 & APX announcement

Re: Intel AVX10 & APX announcement

Re: Intel AVX10 & APX announcement