Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Stop the instruction set war - Agner Fog - 2009-12-05
replythread Stop the instruction set war - Agner Fog - 2009-12-06
last reply The instruction set war's effect on virtualization - Yuhong Bao - 2009-12-28
reply Stop the instruction set war - Agner Fog - 2009-12-15
replythread Stop the instruction set war - Norman Yarvin - 2010-01-09
last replythread Stop the instruction set war - Agner Fog - 2010-01-10
last replythread Stop the instruction set war - bitRAKE - 2010-01-12
last replythread Stop the instruction set war - Agner Fog - 2010-01-13
last reply Pentium Appendix H - Yuhong Bao - 2010-02-10
last replythread Stop the instruction set war - Agner Fog - 2010-09-25
last reply Stop the instruction set war - Agner - 2011-08-28
 
Stop the instruction set war
Author: Agner Fog Date: 2009-12-05 10:43

There is an almost invisible war going on between Intel and AMD. It's the game of who is defining the new additions to the x86 instruction set. This war has been going on behind the scenes for years without being noticed by the majority IT professionals. Most programmers don't care what is going on at the machine code level, so they can't see all the ridiculous consequences that this war has. Those working with virtualization may have noticed that Intel and AMD processors are incompatible when it comes to virtualization software, but this is only one of the more visible consequences of the conflict.

Some important battles

Traditionally, Intel has been the market leader, defining the instruction set for each new generation of microprocessors: 8086, 80186, 80286, 80386, etc. Each new instruction set is a superset of the previous one so that the backwards compatibility is maintained.

Intel's main competitor, AMD, has tried several times to gain the lead by defining their own extensions to the x86 instruction set. In 1998, AMD was the first to introduce Single-Instruction-Multiple-Data (SIMD) instructions in their so-called 3DNow instruction set. Intel never supported the 3DNow instructions. Instead, they introduced the SSE instruction set a few years later. SSE does essentially the same thing as 3DNow, but with a larger register size. Clearly, Intel had won and AMD had to support SSE because it was better than 3DNow.

In 2001, Intel launched their first 64-bit processor named Itanium with a new parallel instruction set. Instead of accepting the new Itanium instruction set, AMD developed their own 64-bit instruction set which - unlike the Itanium - was backwards compatible with the x86 instruction set. The market favored the backwards compatibility so AMD won this time and Intel had to support the AMD64, or x86-64, instruction set in their next processor.

The next important battle is going on right now. It's about instructions with more than two operands. The industry has recognized a need for fused multiply-and-add instructions (e.g.:  D=A*B+C) and several other instructions with more than two operands. The current coding scheme supports only instructions with two operands, so a new coding scheme has to be invented in order to support instructions with more than two operands. AMD came first with a proposal. In August 2007, AMD announced a future instruction set called SSE5 with a new coding scheme. The early disclosure of AMD's intentions was a break with the previous policy where both companies had kept their intentions secret as long as possible. Intel's reply came in April 2008 with an early (probably premature) disclosure of their planned AVX instruction set. Intel's AVX coding scheme was much more flexible and future-oriented than AMD's SSE5 scheme, as I argued in a public discussion forum. Most importantly, the AVX scheme has room for future extensions of the size of the SIMD vector registers, while the SSE5 scheme has little room for any future extensions. It was pretty obvious that Intel had won this time, and thanks to the early disclosure of Intel's AVX instructions, it was not too late for AMD to change their plans. In May 2009, AMD published a revision of their plans where they modified the coding scheme for better compatibility with AVX. In addition to a full support of AVX, the revised AMD plan contains most of the original SSE5 instructions under the new name XOP and with the new coding scheme. Unfortunately, Intel had changed their plans in the meantime! In December 2008, Intel published a revision of their plans which involved a change of the coding of the fused multiply-and-add (FMA) instructions. Now it was too late for AMD to change their design once more, so the first AMD processors with FMA will follow the premature Intel specification rather than Intel's later revision. It is difficult to obtain compatibility when you are following a moving target.

Can our software deal with incompatible CPUs?

Software programmers may expect the compilers and software libraries to take care of all the intricacies of instruction sets for them. And the obvious way to deal with incompatible instruction sets is to make multiple branches of the code. Ideally, you would have one branch of code optimized for the latest Intel instruction set, another branch for the latest AMD instruction set, and one or more branches for older CPUs with older instruction sets. The software should detect which CPU it is running on and then choose the appropriate version of the code. This is called CPU dispatching. If the compiler can put a CPU dispatching mechanism into your code then you don't have to care about incompatible instruction sets - or do you?

The only compiler I have found that has such a feature for automatic CPU dispatching is Intel's compiler. The Intel compiler can put a CPU dispatcher into your code so that it checks which instruction set (SSE, SSE2, SSE3, etc.) is supported by the CPU and chooses a branch of code that is optimized for that instruction set - but only as long as it is running on an Intel CPU! It refuses to choose the optimal branch if the CPU doesn't have the "GenuineIntel" mark, even if the non-Intel CPU if fully compatible with the optimized code. And who would want to sell a software package that works poorly on AMD and VIA processors?

The situation is only slightly better when it comes to software libraries. Most compilers are equipped with libraries of standard functions, or you can use third party libraries. Some of the best optimized software libraries are published by Intel, but again they are optimized for Intel processors, and some of the functions work sub-optimally or not at all on non-Intel processors. AMD also publishes software libraries, and the AMD libraries work well on Intel processors, but of course the AMD libraries don't have a code branch that is optimized for instructions that are only available on Intel processors. There are many other libraries available, but they are typically less optimized and have little or no CPU dispatching. The GNU people are beginning to build a - long overdue - CPU dispatch mechanism into the GNU C library. The GNU library is open source, and of course it must support all x86 CPUs. But this work is done mostly by an Intel guy who has his natural focus on the latest Intel instruction sets and who has so far tested his improvements mainly on Intel processors. The best optimized code branches will work on AMD and VIA processors only with a few years delay when AMD and VIA have copied the Intel instruction sets into their processors. I am not aware of any AMD people contributing the GNU C library.

Of course, a programmer can make his own CPU dispatching, but this is a lot of work. The programmer would have to identify the most critical part of his program and divide it into multiple branches. There is no AMD compiler for Windows, so we would have to use assembly code or intrinsic functions to take advantage of AMD-specific instructions in Windows software. Each branch has to be tested separately on different computers. And the maintenance of the code will be a nightmare. Every change in the code has to be implemented in each branch separately and tested on a separate computer.

The disadvantages of CPU dispatching are clear. It makes the code bigger, and it is so costly in terms of development time and maintenance costs that it is almost never done in a way that adequately optimizes for all brands of CPUs.

The convoluted evolution of the x86 instruction set

Historically, AMD and other companies have copied almost all instructions that Intel have invented in order to maintain compatibility, but they have always lagged a few years behind because of the long development process. On the other side, Intel have never copied the instructions of other companies, except for the x86-64 instructions. For example, AMD were the first to make a prefetch instruction. When Intel made a prefetch instruction shortly after, they used a different code for essentially the same instruction, and AMD had to support the Intel code as well. Likewise, VIA/Centaur were first to make an x86 instruction for AES encryption. Several years later, Intel made a different instruction for the same purpose.

This asymmetry, which is due to Intel's market dominance, has forced software developers to use Intel instructions rather than AMD or VIA instructions when they want compatibility.

The current x86 instruction set is the result of a long evolution which has involved many short-sighted decisions and patches. An instruction is coded as one or more bytes of eight bits each. On the original 8086 processor, all instructions had a single byte indicating the type of instruction, possibly followed by one or more bytes indicating the operands (registers, memory operands, or constants). There are 28 = 256 possible single-byte codes, which soon turned out to be insufficient. When all 256 byte codes were used up, Intel had to discard a never-used instruction code (0F = POP CS) and use it as an escape code for 256 new two-byte codes of 0F followed by another byte (A byte is written as two hexadecimal digits, i.e. 00 - FF).

As you may already have predicted, this new space of 256 two-byte codes eventually became filled up too. The logical thing to do now would be to sacrifice another unused code to open up another page of 256 two-byte codes. In fact, there are three undocumented instruction codes that could have been sacrificed for this purpose, but this never happened. Instead they started to make three-byte codes. The problem with discarding the undocumented codes is that these codes actually do something. Not anything important that can't be done just as well with other codes, but at least it is possible to make a program that uses the undocumented instructions. From a technical point of view, it would have been perfectly acceptable to discard the undocumented codes. These codes are not supported by any compiler or assembler. If any programmer is stupid enough to use an undocumented code, which he has no good reason to do, then he cannot expect his program to work on future processors. But the marketing logic is different. If company X makes a CPU that doesn't support the undocumented instruction codes, then company Y could make an advertising campaign saying that Y CPUs are compatible with all legacy software, X CPUs are not. The incompatible software might be old, obscure and useless pieces of code written by reckless programmers with no respect for compatibility issues, but the marketing argument would still be theoretically true.

The problem with the overcrowded instruction code space has been dealt with from time to time by several workarounds and patches. Today, there are far more than a thousand different instruction codes, and many of them use complicated combinations of escape codes, prefix bytes, and postfix bytes to distinguish the different instructions. This makes instructions longer than necessary and, more importantly, it makes the decoding of the instructions complicated.

To understand why instruction decoding is critical, we have to look at how superscalar processors are working today. A modern microprocessor can execute several instructions simultaneously if it has enough execution units and if it can find enough logically independent instructions in the instruction queue. Executing three, four or five instructions simultaneously is not unusual. The limit is not the execution units, which we have plenty of, but the instruction decoder. The length of an instruction can be anywhere from one to fifteen bytes. If we want to decode several instructions simultaneously, then we have a serious problem. We have to know the length of the first instruction before we know where the second instruction begins. So we can't decode the second instruction before we have decoded the first instruction. The decoding is a serial process by nature, and it takes a lot of hardware to be able to decode multiple instructions per clock cycle. In other words, the decoding of instructions can be a serious bottleneck, and it becomes worse the more complicated the instruction codes are. The new VEX scheme makes the process a little simpler, but we still have to maintain compatibility with the complicated legacy code schemes with all their escape sequences and prefix bytes.

Who owns the codes that are available for future instructions?

As explained above, there is a limited number of unused code bytes available for new instructions. Both Intel, AMD and VIA want to use some of these codes for their new instructions. How is this conflict handled, and how are the vacant codes divided between the competing vendors? We may assume that there are negotiations going on about this, but no public information is available. We can only look at the results and try to guess what has been going on behind the scenes. Judging from which codes are actually used by each company, it looks like Intel has the upper hand in this conflict.

The 256 possible codes of the two-byte instruction code space (0F xx) is divided as follows between the three vendors:

Number of codes Value after 0F Assigned to Used for Subdivided
2 0D, 0E AMD 3DNow  
1 0F AMD 3DNow by suffix byte
4 24, 25, 7A, 7B AMD SSE5 by another escape byte
2 A6, A7 VIA Instructions by reg bits
2 38, 3A Intel SSSE3, SSE4 by another escape byte
2 39, 3B Intel for future use by another escape byte
6 19 - 1E reserved hint instructions  
11 04, 0A, 0C, 26, 27,
36, 3C, 3D, 3E, 3F, FF
  unused  
226 All other Intel used  

As you can see, only a small fraction of the code space is used for instructions introduced by AMD and VIA.

It gets worse when we look at the code space defined by the VEX coding scheme. This scheme has room for 216 = 65536 instructions, so there is plenty of room for future instructions without adding extra prefix or suffix bytes. Yet, AMD has not used any of this code space for their new XOP instruction set. Instead, they have made another coding scheme which is very similar to the VEX scheme, but beginning with the byte 8F, where the VEX code begins with C4 or C5. We can only speculate whether the AMD engineers have asked Intel for permission to use part of the huge VEX space and got a no, or whether they have given up beforehand. All we know is that there are disadvantages to using a different coding scheme.

The bytes that follow after C4 or C5 in the VEX scheme are coded in a special ingenious way in order to avoid clashing with existing instructions. It is not possible to use exactly the same method with the XOP scheme beginning with 8F, hence there are small differences between the XOP scheme and the VEX scheme. It would have been possible to make the two schemes identical if AMD had used the initial byte 62 instead of 8F for the XOP scheme, but perhaps Intel have reserved the 62 code for future use. Arguably, it would be possible to use the codes D4 and D5 as well, though with some extra complications.

The small differences between Intel's VEX scheme and AMD's XOP scheme adds an extra complication to the instruction decoder in the CPU. This reduces the likelihood that Intel will copy any of the XOP instructions. If it turns out that some of AMD's XOP instructions are so useful that the software industry will ask Intel to copy them, then we may fear that Intel will choose a VEX encoding for these instructions rather than making their code compatible with AMD's.

The free competition

The x86 instruction set reflects a mechanism that is typical for technical evolution in a free market. One company makes one solution, another company makes another solution, and the market forces decide which solution will be most popular. A de facto standard evolves when one solution goes out of the market and everybody adopts the other solution.

So far, so good. But the "market" for x86 instructions differs from other technical markets by the fact that all inventions are irreversible. We have seen that the microprocessor vendors keep supporting even the oldest obsolete or undocumented instructions for marketing reasons, even when the technical advantage of backwards compatibility is negligible compared to the costs. Intel keeps supporting the old undocumented instructions of the original 8086 processor, and AMD keeps supporting the 3DNow instructions that hardly any programmer uses because the market forces have replaced them with the better SSE instructions.

The costs of supporting obsolete instructions is not negligible. You need large execution units to support a large number of instructions. This means more silicon space, longer data paths, more power consumption, and slower execution.

The total number of x86 instructions is well above one thousand. One may ask whether there is a technical need for such a large number of instructions or if some instructions have been added more for marketing reasons than for technical utility.

We need an open standardization process

The free competition on the microprocessor market has certainly been good for the price and performance of CPUs, but it has not been good for the compatibility. We are in a situation where different companies are competing to invent new instructions and keeping their ideas secret from each other and from their costumers as long as possible. It is clear that the problems discussed above cannot be solved optimally without some kind of regulation and coordination. We need an open standardization committee or at least some form of public deliberation to define new instructions and decide how they are coded.

The current situation with unregulated competition and secret development fails to address the following issues:

  1. Unfair competition. The market often favors Intel instructions rather than AMD or VIA instructions for compatibility reasons. The latter companies can only copy new Intel instructions with a delay of a few years.
        

  2. AMD does not have access to a fair share of the opcode space to use for their innovations. Historically, AMD has used small corners of the opcode space to avoid the risk that Intel might assign another instruction to the same code. There is no part of the huge VEX opcode space that AMD can safely use without permission from Intel.
      

  3. Technical incompatibility. AMD, Intel and VIA are assigning different codes to identical or equivalent instructions because each keep their innovations secret for as long as possible. It is so expensive for the software industry to make multiple versions of their software that hardly anybody does so.
       

  4. Short-sighted solutions. The history of the evolution of the x86 instruction set is full of shortsighted decisions that are sub-optimal in a long term perspective. For example, when the vector registers were extended from MMX to XMM, there was no plan for how to handle the predictable future extension to YMM. If such a plan had been made then we wouldn't need the complexity today of having two versions of every XMM instruction (one that zero-extends into the YMM register and one that leaves the upper part of the register unchanged). A standardization committee or public forum would be more likely to include long-term planning.
      

  5. Sub-optimal solutions. Some instructions could be implemented better at no extra costs. For example, the PANDN and PALIGNR instructions would be more efficient if the two operands were swapped. A public discussion would have corrected such lapses before it was too late.
      

  6. Feedback from users is always too late. When a new instruction set is published, there is often public criticism, but then it is too late to change anything. The secrecy around innovations makes it impossible to involve the larger software community in the decision making process.
      

  7. PR considerations often have more weight than technical considerations. Currently, we have far more than a thousand instructions in the x86 instruction set. This is more than any programmer can memorize. It would be better to have fewer instructions and make each instruction more flexible so that it would cover more applications. But there is an obvious PR value in announcing that the newest processor has a bazillion new instructions. The weird and sometimes deliberately misleading names of the instruction set extensions are obviously decided by PR people rather than by technicians.
      

  8. Backwards compatibility is taken too far. Today's microprocessors are still supporting even the most obscure undocumented instructions of the first 8086 processor from thirty years ago, while operating systems sometimes fail to support software that is five years old. There is no technical reason for this, only a PR reason. The cost of supporting undocumented and obsolete instructions is actually quite high because they take up space in the overcrowded opcode map. If the undocumented codes had been eliminated then all instructions in the SSSE3 and SSE4  instruction sets would have a one-byte escape code rather than a two-bytes escape code.
      

  9. Inability to declare anything obsolete. There are many things in the x86 instruction set that needs to be cleaned up and sanitized, which an unregulated market is unable to do. A standardization committee could declare that standards-compliant software should not use a certain feature. Support for this feature could then be removed after e.g. ten years. For example, the x87 register stack is clearly obsolete. If the standard says, don't use x87 and MMX registers, then we could replace all x87 instructions by emulation after a number of years. It is quite costly in terms of silicon space and performance to support the x87 instructions. Some processors even have an extra stage in the pipeline only for rotating the x87 register stack.
      

  10. The evolution of the x86 instruction set has many dead ends which are never eliminated. When two different companies invent two different solutions to the same problem, then the market is likely to favor one solution while the other becomes a dead end. However, the company that introduced the solution that happened to become a dead end will keep supporting it in all future for marketing reasons, regardless of the technical costs.

My conclusion is that we need an open standardization committee or a public forum to discuss proposed additions and changes to the x86 instruction set and define an open standard. This committee or forum should of course involve representatives from the hardware vendors as well as the software industry, engineering organizations, standardization organizations, university scientists and consumer organizations.

I think it is unlikely that Intel will voluntarily submit to such a standardization initiative because they have a competitive advantage in the current situation. A considerable pressure from outside is needed. This pressure could come from the software industry, from governments, political organizations, legal rulings, academic organizations, or from debates in public media. As a beginning, I hereby invite all interested persons to discuss these issues in various media and public forums.

Links

Original discussion of these issues on AMD developer forums
AMD and Intel incompatible - What to do?
  
Discussion of AVX versus SSE5 on Aceshardware forum
Intel AVX kills AMD SSE5
  
Same on Real World Technologies forum
Intel AVX kills AMD SSE5
  
My C++ manual, discussing CPU dispatching and fixing the Intel CPU dispatcher
Optimizing software in C++
   
Stop the instruction set war
Author: Agner Fog Date: 2009-12-06 04:28
Thank you to Yuhong Bao and others for sending me information about more conflicts over the instruction code map.

The company Cyrix has used many codes on the 0F xx map for their instructions, apparently without having an agreement with Intel. See sandpile.org for a list. Many of these codes are now used for other purposes. For example the Cyrix instruction SMINT originally used the code 0F 7E. Later, Intel used the same code for the instruction MOVD, so Cyrix had to change their code for SMINT to 0F 38. Today, the latter code is also used by Intel. The Cyrix processors have been continued as AMD Geode processors where the conflicting codes are still used (including 0F 38), though they can be disabled.

Vacant codes are also needed by software producers for virtual instructions that can be emulated. Microsoft is using the code C4 C4 in Windows for such a purpose. This code now conflicts with the new VEX instructions, which is the reason why Intel had to disable VEX instructions in 16-bit real and virtual mode. Only two codes are reserved for software emulation. These are called UD1 (0F B9) and UD2 (0F 0B).

The instructions POPCNT and RDTSCP were implemented first by AMD and later copied by Intel.

[Corrections made 2009-12-07 and later thanks to Yuhong Bao and others]

   
The instruction set war's effect on virtualization
Author:  Date: 2009-12-28 03:34
BTW, AnandTech mention live VM migration as another example of where x86 CPU extensions can cause a lot of hassle.
You have to often fiddle with CPU masks to migrate across CPU generations, and even then it isn't always possible to mask all features..
But even more important is the effect on cross-vendor live VM migrations.
In fact, Red Hat and AMD demoed cross-vendor live VM migration back in 2008:
linux.slashdot.org/article.pl?sid=08/11/07/1535235
It isn't mentioned very often in the discussions, but it is important.
You see, back before AMD adopted AVX, AMD was going with SSE5 (in fact SSE4a is available already on today's Family 10h AMD processors) and Intel was going with AVX.
If cross-vendor live VM migration was to work properly, the VM would have to be crippled all the way back down to SSE3.
Even now, the FMA4 vs FMA3 wars means that VMs that have to migrate between Intel Ivy Bridge processors and AMD Bulldozer processors would have no access to FMA at all.
   
Stop the instruction set war
Author: Agner Fog Date: 2009-12-15 05:53
My blog post has caused a lot of discussion on the following messageboards: Thanks to everybody who have contributed.
   
Stop the instruction set war
Author:  Date: 2010-01-09 13:01
The x87 instruction set won't truly be obsolete until SSEx has support for floating-point formats of >64 bits. As it is, using those old instructions is a good way to get some extra precision (which can be quite valuable: rather than having to analyze a program in detail to see whether it is numerically stable, one can just re-run it with higher precision and see if the results change much.)
   
Stop the instruction set war
Author: Agner Fog Date: 2010-01-10 01:41
Norman Yarvin wrote:
The x87 instruction set won't truly be obsolete until SSEx has support for floating-point formats of >64 bits.
I agree. We need XMM instructions with 80 bits, or better 128 bits, extended floating point precision before we can eliminate x87 completely. This feature should be optional because it is expensive to implement and few users would need it. There are also a few MMX conversion instructions that we need to implement as XMM instructions before we can eliminate the MMX registers.

But Microsoft has never supported the 80 bits (long double) precision in their compiler. And the first preliminary specification for x64 Windows banned x87 and MMX. For some reason, they changed their mind and allowed x87/MMX (See my manual on calling conventions).

All this just shows that we need coordination and planning rather than each company making its own decisions.

   
Stop the instruction set war
Author: bitRAKE Date: 2010-01-12 11:49
Couldn't the instruction cache store an efficient post-decode encoding for instructions? IIRC, Intel already has a patent for doing this. Another possiblity would be to completely remap the instructions to favor parallel decoding. This would support backward capatiblity thru a processor external translator. Getting the cooperation (from both producers and consumers) might require a demonstraition of what can be gained.

Why couldn't Transmeta succeed? Does hyper-threading make the decoder a greater target? Has Intel used instruction set changes to negatively impact competitors (i.e. publishing the second best while secretly working on the target design)? Can a fair market exist with Intel's obvious advantage (both capital and market share)? Should something so one-sided be called a war?

   
Stop the instruction set war
Author: Agner Fog Date: 2010-01-13 01:37
bitRAKE wrote:
Couldn't the instruction cache store an efficient post-decode encoding for instructions?
AMD stores instruction boundaries in the code cache in order to make decoding easier. Intel did the same in the old Pentium MMX, IIRC. I don't know why they are not doing this any more.

Another possiblity would be to completely remap the instructions to favor parallel decoding.
They did that in Itanium. But emulation of x86 is too slow. The CISC instruction set, while difficult to decode, has the advantage that it takes less space in the code cache.

Has Intel used instruction set changes to negatively impact competitors (i.e. publishing the second best while secretly working on the target design)?
I don't think they have ever deliberately published suboptimal instructions. They have failed to support AMD instructions, and they have changed from FMA4 to FMA3 for unknown reasons. FMA4 is obviously better than FMA3 from the programmer's point of view. There may be technical limitations that made them change to FMA3.
   
Pentium Appendix H
Author:  Date: 2010-02-10 11:46
Intel once tried to hide some of the new features of the Pentium from x86 competitors by requiring an NDA to be signed in order for info to be disclosed. It was nicknamed Appendix H because it was mentioned in the Appendix H of the Pentium processor family developer's manuals. AMD was able to reverse-engineer the Pentium and offer the K5 with all of them except APIC, but Cyrix cheated and only implemented the 486 instruction set in it's 6x86 and disabled the CPUID instruction by default. In the 6x86L, DE and CX8 was implemented, and in the 6x86MX, they implemented the features TSC and MSR from the Pentium and CMOV and PGE from the P6, but no PSE or VME.
Centaur when it released the WinChip decided to again not implement PSE or VME. They also did not implement CMOV or PGE unlike Cyrix 6x86MX. They implemented MCE unlike Cyrix though. WinChip 2 added 3DNow!. Eventually Centaur was sold to VIA Technologies, and it retargeted the core to Socket 370 and the P6 bus and marketed it as the VIA C3, but the core was still virtually the same as before in features with the only difference being that Intel's MTRRs replaced Centaur's MCR and the addition of PGE. Even worse, by then, Windows 2000 was released in which the NTVDM crashed without VME, forcing VIA to provide a patch to NTVDM. It was only with Nehemiah that VIA finally began to really improve the core, with SSE replacing 3DNow!, and PSE and CMOV being implemented. With stepping 8 Nehemiah, VIA finally added VME, SEP, and PAT, catching up with the Pentium III.
Rise mP6 was even worse, with it only implementing TSC, CX8, and MMX.
Cyrix MediaGX implemented only 486 level features like 5x86 and 6x86, and MediaGXm implemented CX8, TSC, MSR, CMOV, and MMX. Later processors in that series of course added more features.
Transmeta was better, with the Crusoe implementing Pentium MMX features (I think) plus CMOV and later SEP.
You can see here also that the 586/686 distinction can be quite blurry too, with lots of processors implementing only some 686 features. Even Intel's own Pentium M did not support PAE at all in the original version (luckily the option of using PAE is separate from the option of using i686 instructions in most OSes). The long NOPs that was introduced with the P6 were troublesome too, with even VIA Nehemiah not implementing it.
By now, it should be clear that Appendix H did a lot more harm than good, and it was only because the CPU feature bits that was invented with the CPUID instruction that software can wade through the mess. Before then, software just tested for CPU generation (for example, the 386/486 was differed by the test for EFLAGS.AC. Unfortunately I read that the IBM 386SLC CPU was really a relabeled 486 with all 486 instructions but with it being modified so that this test detects a 386, for reasons relating to Intel licensing. And the NexGen Nx586 originally implemented only 386 features, but later a hypercode update allowed user-mode 486 instructions to be supported if an option was enabled, but no kernel-mode instructions which was used by NT 4.0 and later preventing it from running), which has been considered dead since the introduction of CPUID. In fact, Intel did not bother creating a feature bit for the long NOPs, which means that it has to be manually tested via software using the illegal opcode exception, which was even harder in kernel mode because Connectix/Microsoft Virtual PC when encountering them in kernel mode code pops a fatal error that forces a reset of the virtual machine!
I wrote this from the research I did, and I got most of the CPU features mentioned above from datasheets from datasheets.chipdb.org , if there is any errors please correct!
   
Stop the instruction set war
Author: Agner Fog Date: 2010-09-25 10:47
Back in December 2009 I wrote
If it turns out that some of AMD's XOP instructions are so useful that the software industry will ask Intel to copy them, then we may fear that Intel will choose a VEX encoding for these instructions rather than making their code compatible with AMD's.
Now they are doing exactly this. When AMD announced their planned XOP instruction set they also announced the "CVT16" instructions for supporting floating point numbers with half precision, using their XOP code prefix. The names of these instructions were VCVTPH2PS and VCVTPS2PH. Now Intel have announced two almost identical instructions with the same names, but using their own VEX code prefix. Furthermore, AMD have postponed the implementation of these instructions. Whether they have done so for the sake of compatibility with Intel's instructions, we don't know.

If Intel had allowed AMD to use part of the huge VEX opcode space then this would not have happened. We can only speculate what is going on behind the scenes...

Link: Intel Advanced Vector Extensions Programming Reference, Aug 2010.

   
Stop the instruction set war
Author: Agner Date: 2011-08-28 08:30

Here is an update of instructions that were first announced by AMD and later copied by Intel:

Instruction name AMD instruction set Intel instruction set Compatible Remark
prefetch 3DNow SSE no Intel name: prefetcht0, etc.
64 bit mode AMD64 Intel 64 yes  
rdtscp SSE4A (AVX) yes Separate CPUID bit
lzcnt SSE4A future AVX2 yes  
vpshld, etc. SSE5/XOP future AVX2 no Intel name: vpsllvd, etc.
cvtph2ps, cvtps2ph SSE5/XOP future no  
vfmaddps, etc. SSE5/XOP FMA3 no Both AMD and Intel have changed their codes.
Final version is incompatible

While AMD keeps copying almost all Intel instructions (except virtualization instructions) for the sake of compatibility, only few of AMDs instructions are copied by Intel. In those cases where Intel have copied an AMD instruction using the XOP coding scheme, they have made an incompatible code using the VEX coding scheme.