Software optimization resources | E-mail subscription to this blog | www.agner.org
|Flat View||Search||List||List Messageboards||Help|
|Stop the instruction set war|
|Author: Agner Fog||Date: 2009-12-05 10:43|
There is an almost invisible war going on between Intel and AMD. It's the game of who is defining the new additions to the x86 instruction set. This war has been going on behind the scenes for years without being noticed by the majority IT professionals. Most programmers don't care what is going on at the machine code level, so they can't see all the ridiculous consequences that this war has. Those working with virtualization may have noticed that Intel and AMD processors are incompatible when it comes to virtualization software, but this is only one of the more visible consequences of the conflict.
Some important battles
Traditionally, Intel has been the market leader, defining the instruction set for each new generation of microprocessors: 8086, 80186, 80286, 80386, etc. Each new instruction set is a superset of the previous one so that the backwards compatibility is maintained.
Intel's main competitor, AMD, has tried several times to gain the lead by defining their own extensions to the x86 instruction set. In 1998, AMD was the first to introduce Single-Instruction-Multiple-Data (SIMD) instructions in their so-called 3DNow instruction set. Intel never supported the 3DNow instructions. Instead, they introduced the SSE instruction set a few years later. SSE does essentially the same thing as 3DNow, but with a larger register size. Clearly, Intel had won and AMD had to support SSE because it was better than 3DNow.
In 2001, Intel launched their first 64-bit processor named Itanium with a new parallel instruction set. Instead of accepting the new Itanium instruction set, AMD developed their own 64-bit instruction set which - unlike the Itanium - was backwards compatible with the x86 instruction set. The market favored the backwards compatibility so AMD won this time and Intel had to support the AMD64, or x86-64, instruction set in their next processor.
The next important battle is going on right now. It's about instructions with more than two operands. The
industry has recognized a need for fused multiply-and-add instructions (e.g.:
Can our software deal with incompatible CPUs?
Software programmers may expect the compilers and software libraries to take care of all the intricacies of instruction sets for them. And the obvious way to deal with incompatible instruction sets is to make multiple branches of the code. Ideally, you would have one branch of code optimized for the latest Intel instruction set, another branch for the latest AMD instruction set, and one or more branches for older CPUs with older instruction sets. The software should detect which CPU it is running on and then choose the appropriate version of the code. This is called CPU dispatching. If the compiler can put a CPU dispatching mechanism into your code then you don't have to care about incompatible instruction sets - or do you?
The only compiler I have found that has such a feature for automatic CPU dispatching is Intel's compiler. The Intel compiler can put a CPU dispatcher into your code so that it checks which instruction set (SSE, SSE2, SSE3, etc.) is supported by the CPU and chooses a branch of code that is optimized for that instruction set - but only as long as it is running on an Intel CPU! It refuses to choose the optimal branch if the CPU doesn't have the "GenuineIntel" mark, even if the non-Intel CPU if fully compatible with the optimized code. And who would want to sell a software package that works poorly on AMD and VIA processors?
The situation is only slightly better when it comes to software libraries. Most compilers are equipped with libraries of standard functions, or you can use third party libraries. Some of the best optimized software libraries are published by Intel, but again they are optimized for Intel processors, and some of the functions work sub-optimally or not at all on non-Intel processors. AMD also publishes software libraries, and the AMD libraries work well on Intel processors, but of course the AMD libraries don't have a code branch that is optimized for instructions that are only available on Intel processors. There are many other libraries available, but they are typically less optimized and have little or no CPU dispatching. The GNU people are beginning to build a - long overdue - CPU dispatch mechanism into the GNU C library. The GNU library is open source, and of course it must support all x86 CPUs. But this work is done mostly by an Intel guy who has his natural focus on the latest Intel instruction sets and who has so far tested his improvements mainly on Intel processors. The best optimized code branches will work on AMD and VIA processors only with a few years delay when AMD and VIA have copied the Intel instruction sets into their processors. I am not aware of any AMD people contributing the GNU C library.
Of course, a programmer can make his own CPU dispatching, but this is a lot of work. The programmer would have to identify the most critical part of his program and divide it into multiple branches. There is no AMD compiler for Windows, so we would have to use assembly code or intrinsic functions to take advantage of AMD-specific instructions in Windows software. Each branch has to be tested separately on different computers. And the maintenance of the code will be a nightmare. Every change in the code has to be implemented in each branch separately and tested on a separate computer.
The disadvantages of CPU dispatching are clear. It makes the code bigger, and it is so costly in terms of development time and maintenance costs that it is almost never done in a way that adequately optimizes for all brands of CPUs.
The convoluted evolution of the x86 instruction set
Historically, AMD and other companies have copied almost all instructions that Intel have invented in order to maintain compatibility, but they have always lagged a few years behind because of the long development process. On the other side, Intel have never copied the instructions of other companies, except for the x86-64 instructions. For example, AMD were the first to make a prefetch instruction. When Intel made a prefetch instruction shortly after, they used a different code for essentially the same instruction, and AMD had to support the Intel code as well. Likewise, VIA/Centaur were first to make an x86 instruction for AES encryption. Several years later, Intel made a different instruction for the same purpose.
This asymmetry, which is due to Intel's market dominance, has forced software developers to use Intel instructions rather than AMD or VIA instructions when they want compatibility.
The current x86 instruction set is the result of a long evolution which has involved many short-sighted decisions and patches. An instruction is coded as one or more bytes of eight bits each. On the original 8086 processor, all instructions had a single byte indicating the type of instruction, possibly followed by one or more bytes indicating the operands (registers, memory operands, or constants). There are 28 = 256 possible single-byte codes, which soon turned out to be insufficient. When all 256 byte codes were used up, Intel had to discard a never-used instruction code (0F = POP CS) and use it as an escape code for 256 new two-byte codes of 0F followed by another byte (A byte is written as two hexadecimal digits, i.e. 00 - FF).
As you may already have predicted, this new space of 256 two-byte codes eventually became filled up too. The logical thing to do now would be to sacrifice another unused code to open up another page of 256 two-byte codes. In fact, there are three undocumented instruction codes that could have been sacrificed for this purpose, but this never happened. Instead they started to make three-byte codes. The problem with discarding the undocumented codes is that these codes actually do something. Not anything important that can't be done just as well with other codes, but at least it is possible to make a program that uses the undocumented instructions. From a technical point of view, it would have been perfectly acceptable to discard the undocumented codes. These codes are not supported by any compiler or assembler. If any programmer is stupid enough to use an undocumented code, which he has no good reason to do, then he cannot expect his program to work on future processors. But the marketing logic is different. If company X makes a CPU that doesn't support the undocumented instruction codes, then company Y could make an advertising campaign saying that Y CPUs are compatible with all legacy software, X CPUs are not. The incompatible software might be old, obscure and useless pieces of code written by reckless programmers with no respect for compatibility issues, but the marketing argument would still be theoretically true.
The problem with the overcrowded instruction code space has been dealt with from time to time by several workarounds and patches. Today, there are far more than a thousand different instruction codes, and many of them use complicated combinations of escape codes, prefix bytes, and postfix bytes to distinguish the different instructions. This makes instructions longer than necessary and, more importantly, it makes the decoding of the instructions complicated.
To understand why instruction decoding is critical, we have to look at how superscalar processors are working today. A modern microprocessor can execute several instructions simultaneously if it has enough execution units and if it can find enough logically independent instructions in the instruction queue. Executing three, four or five instructions simultaneously is not unusual. The limit is not the execution units, which we have plenty of, but the instruction decoder. The length of an instruction can be anywhere from one to fifteen bytes. If we want to decode several instructions simultaneously, then we have a serious problem. We have to know the length of the first instruction before we know where the second instruction begins. So we can't decode the second instruction before we have decoded the first instruction. The decoding is a serial process by nature, and it takes a lot of hardware to be able to decode multiple instructions per clock cycle. In other words, the decoding of instructions can be a serious bottleneck, and it becomes worse the more complicated the instruction codes are. The new VEX scheme makes the process a little simpler, but we still have to maintain compatibility with the complicated legacy code schemes with all their escape sequences and prefix bytes.
Who owns the codes that are available for future instructions?
As explained above, there is a limited number of unused code bytes available for new instructions. Both Intel, AMD and VIA want to use some of these codes for their new instructions. How is this conflict handled, and how are the vacant codes divided between the competing vendors? We may assume that there are negotiations going on about this, but no public information is available. We can only look at the results and try to guess what has been going on behind the scenes. Judging from which codes are actually used by each company, it looks like Intel has the upper hand in this conflict.
The 256 possible codes of the two-byte instruction code space (0F xx) is divided as follows between the three vendors:
As you can see, only a small fraction of the code space is used for instructions introduced by AMD and VIA.
It gets worse when we look at the code space defined by the VEX coding scheme. This scheme has room for 216 = 65536 instructions, so there is plenty of room for future instructions without adding extra prefix or suffix bytes. Yet, AMD has not used any of this code space for their new XOP instruction set. Instead, they have made another coding scheme which is very similar to the VEX scheme, but beginning with the byte 8F, where the VEX code begins with C4 or C5. We can only speculate whether the AMD engineers have asked Intel for permission to use part of the huge VEX space and got a no, or whether they have given up beforehand. All we know is that there are disadvantages to using a different coding scheme.
The bytes that follow after C4 or C5 in the VEX scheme are coded in a special ingenious way in order to avoid clashing with existing instructions. It is not possible to use exactly the same method with the XOP scheme beginning with 8F, hence there are small differences between the XOP scheme and the VEX scheme. It would have been possible to make the two schemes identical if AMD had used the initial byte 62 instead of 8F for the XOP scheme, but perhaps Intel have reserved the 62 code for future use. Arguably, it would be possible to use the codes D4 and D5 as well, though with some extra complications.
The small differences between Intel's VEX scheme and AMD's XOP scheme adds an extra complication to the instruction decoder in the CPU. This reduces the likelihood that Intel will copy any of the XOP instructions. If it turns out that some of AMD's XOP instructions are so useful that the software industry will ask Intel to copy them, then we may fear that Intel will choose a VEX encoding for these instructions rather than making their code compatible with AMD's.
The free competition
The x86 instruction set reflects a mechanism that is typical for technical evolution in a free market. One company makes one solution, another company makes another solution, and the market forces decide which solution will be most popular. A de facto standard evolves when one solution goes out of the market and everybody adopts the other solution.
So far, so good. But the "market" for x86 instructions differs from other technical markets by the fact that all inventions are irreversible. We have seen that the microprocessor vendors keep supporting even the oldest obsolete or undocumented instructions for marketing reasons, even when the technical advantage of backwards compatibility is negligible compared to the costs. Intel keeps supporting the old undocumented instructions of the original 8086 processor, and AMD keeps supporting the 3DNow instructions that hardly any programmer uses because the market forces have replaced them with the better SSE instructions.
The costs of supporting obsolete instructions is not negligible. You need large execution units to support a large number of instructions. This means more silicon space, longer data paths, more power consumption, and slower execution.
The total number of x86 instructions is well above one thousand. One may ask whether there is a technical need for such a large number of instructions or if some instructions have been added more for marketing reasons than for technical utility.
We need an open standardization process
The free competition on the microprocessor market has certainly been good for the price and performance of CPUs, but it has not been good for the compatibility. We are in a situation where different companies are competing to invent new instructions and keeping their ideas secret from each other and from their costumers as long as possible. It is clear that the problems discussed above cannot be solved optimally without some kind of regulation and coordination. We need an open standardization committee or at least some form of public deliberation to define new instructions and decide how they are coded.
The current situation with unregulated competition and secret development fails to address the following issues:
My conclusion is that we need an open standardization committee or a public forum to discuss proposed additions and changes to the x86 instruction set and define an open standard. This committee or forum should of course involve representatives from the hardware vendors as well as the software industry, engineering organizations, standardization organizations, university scientists and consumer organizations.
I think it is unlikely that Intel will voluntarily submit to such a standardization initiative because they have a competitive advantage in the current situation. A considerable pressure from outside is needed. This pressure could come from the software industry, from governments, political organizations, legal rulings, academic organizations, or from debates in public media. As a beginning, I hereby invite all interested persons to discuss these issues in various media and public forums.
|Reply To This Message||Next Message|
|Stop the instruction set war - Agner Fog - 2009-12-05|
|Stop the instruction set war new - Agner Fog - 2009-12-06|
|The instruction set war's effect on virtualization new - Yuhong Bao - 2009-12-28|
|Stop the instruction set war new - Agner Fog - 2009-12-15|
|Stop the instruction set war new - Norman Yarvin - 2010-01-09|
|Stop the instruction set war new - Agner Fog - 2010-01-10|
|Stop the instruction set war new - bitRAKE - 2010-01-12|
|Stop the instruction set war new - Agner Fog - 2010-01-13|
|Pentium Appendix H new - Yuhong Bao - 2010-02-10|
|Stop the instruction set war new - Agner Fog - 2010-09-25|
|Stop the instruction set war new - Agner - 2011-08-28|
|Stop the instruction set war new - Ruslan - 2016-04-17|
|Stop the instruction set war new - Agner - 2016-04-17|
|Flat View||Search||List||List Messageboards||Help|