Introduction
An
instruction set is a standardized set of machine instructions that a
computer can run.
There
are many instruction sets in use. To introduce a new instruction set
is not an easy thing to do because it breaks the compatibility with
existing software and hardware. Therefore, the successful
introduction of a new instruction set is a rare occurrence in the
evolution of computer technology, while extensions to existing
instruction sets occurs frequently. Some commonly used instruction
sets are poorly designed from the beginning and amended with many
extensions and patches. One of the worst cases is the widely used x86
instruction set family. This instruction set is the result of a long
series of short-sighted extensions and patches. The result of this
development is a very complicated code system which is very difficult
and costly to decode in a microprocessor. We need to learn from past
failures in order to be prepared to make better choices from the
start, in case the opportunity to design a new instruction set should
come up. The purpose of this article is to construct an example of a
new instruction set that is better designed from the start, based on
the experience we have with existing instruction sets. The following
principles are important to have in mind:
The
instruction set should have a simple and consistent design.
The
instruction set should represent a suitable compromise between the
RISC principle that enables fast decoding, and the CISC principle
that makes more efficient use of code cache resources.
The
design should be expandable so that new instructions and extensions
can be added in a consistent and predictable way.
The
instruction set should be designed through an open process with the
participation of the international hardware and software community.
The
instruction set should be non-proprietary and allow anybody to make
compatible software, hardware and equipment for test, debugging and
emulation.
Decisions
about design and extensions should not be determined by the short
term marketing considerations of an oligopolistic microprocessor
industry but by the long term needs of the entire hardware and
software community and NGOs.
The design should allow appliction-specific extensions.
The
problems with the x86 instruction set are discussed in my blog
article Stop
the instruction set war.
See also
Krste Asanović and David Patterson: "The Case for Open
Instruction Sets. Open ISA Would Enable Free Competition in Processor
Design". Microprocessor Report, August 18, 2014.
Basic architecture
Instruction format
A
pure RISC instruction set has the advantage that all instructions
have the same length. This makes it easy to decode multiple
instructions in parallel. But it has the disadvantage of using a lot
of precious space in the code cache. A CISC instruction format can
have a variable instruction length. The well-known x86 format allows
instructions of any length from 1 to 15 bytes. This makes the code
more compact, but it is very complicated and expensive to decode. It
is difficult for the microprocessor to decode multiple instructions
in parallel because it needs to find the length of the first
instruction before it knows where the second instruction begins, and
the instruction length is determined by a complicated algorithm
involving many elements of the instruction. Instruction decoding is
therefore often a serious bottleneck.
The
proposed instruction format is a compromise between these two
principles. Instructions can have a few standardized lengths and
formats, and the determination of the length is simple. This allows
for smaller instructions to save size, and longer instructions when
there is a need for more bits for address, data, registers or extra
options. Many instructions exist in multiple versions with different
sizes. The instruction format is completely orthogonal in the sense that the
same instruction can be specified with register, memory or immediate
operands, different integer sizes, different floating point
precisions, different vector lengths, and different addressing modes.
An
instruction can use one, two or three dwords of 32 bits each - that
is 32, 64 or 96 bits. No other sizes are permitted. Instructions must
be aligned to dword addresses. The first two bits (most significant
bits) of the first dword of an in struction indicates the length:
00 = 1 dword
01 = 1 dword
10 = 2 dwords
11 = 3 dwords
In
order to further save space in the code cache, there can be certain
instructions which do multiple operations in a single short
instruction, such as:
Set
multiple registers to the same constant value (typically zero). The
first register and the number of registers is specified.
Read
multiple registers from consecutive memory addresses. Optionally
increment pointer by the size. Can use the stack pointer or any
other pointer register.
Save
multiple registers to consecutive memory addresses. Optionally
increment or decrement pointer by the size. Can use the stack
pointer or any other pointer register.
Combined
arithmetic operation and conditional jump.
A
dword of all zeroes is a nop (no operation). The processor is allowed
to skip nop's as fast as it can. These nop's can be used as fillers,
but not as timing delays.
Registers
There
are 32 universal registers, named r0 - r31. The proposed instruction
set has only one type of registers. These registers can be
used for all types of data: Boolean, 8-, 16-, 32-, 64- and
(optionally) 128-bit signed and unsigned integers, floating point
numbers with single, double and (optionally) quadruple precision,
pointers, references, flags and predication masks. This reduces the
number of different instructions because the same instruction can be
used on different types of data, and because no instructions are
needed for transferring data from one type of register to another.
For example, the same 'AND' instruction can be used for operations on
Booleans, for manipulating bits in integers, for manipulating the
sign bit of floating point numbers, and for manipulating predication
masks.
The
same registers can also be used as vectors of any of these data
types. The microprocessor must support vectors of at least 128 bits.
Support for larger sizes is optional. Vector sizes up to 8192 bits
can be specified by 3 bits in the instruction code. It is also
possible to specify the largest available vector size in an
instruction. This can be anything from 128 bits and up, with no upper
limit. Software can take advantage of future extensions by specifying
the largest available vector size. The largest size can be modified by a
control register to any power of 2 from 128 to the largest size
supported by the microprocessor.
The
unused part of a register is always set to zero whenever a register
is modified. No instruction leaves part of a register unchanged
except for instructions intended for blending or interleaving data.
This is important in order to avoid false dependencies on the
previous value of the full register, which is known to cause serious
performance problems in some existing processors (known as partial
register stall). The processor does not actually need to spend power
on setting all the superfluous bits to zero. Typically, it will
simply turn off the unused parts of execution units and data buses in
order to save power.
Stack
There
is one stack. The stack register is r31. Including the stack register
as one of the universal registers makes it possible to use it as a
base pointer in memory addressing and to modify the stack frame with
arithmetic instructions. The stack register needs only be 64 bits.
Instruction pointer
The
instruction pointer is 64 bits. It is not included in the universal
registers. The reason for this decision is to avoid the possible
modification of the instruction pointer by arithmetic instructions,
which would make branch prediction difficult.
Flags
There
is no dedicated flags register. Registers r1 - r7 can be used as
predicate registers or mask registers. Many instructions can be
predicated. A predicated non-vector instruction will use one of these
registers as predicate, and execute the instruction only when bit 0
in the predicate register is 1. The predicate register is thus also a
Boolean variable. Execution is unconditional when r0 is specified as
the predicate register.
The
predication mechanism can be vectorized. A predicate vector is also
known as a mask. A masked vector instruction works in the following
way. Each element in the vector is processed only if the
corresponding element in the mask register has 1 in its least
significant bit. The mask register is treated as a vector of
Booleans, where each element in the Boolean vector has the same
number of bits as the data vectors in the instruction, and only the
least significant bit in each Boolean vector element is used, while
the remaining bits are ignored. (Other systems use the most
significant bit, or all bits, in the mask, but it is preferred to use
the least significant bit for the sake of compatibility between
Boolean scalars and Boolean vectors). Results that are masked off are
either unchanged or set to zero, depending on the instruction. Some
instructions support both options to be selected with a feature bit.
Instructions
for extended precision arithmetic, such as add-with-carry and
subtract-with borrow work in the following way. One register is
specified in the predicate register field of the instruction code.
Bit 0 of this register is used as both carry-in and carry-out. The
traditional arithmetic flags are written to a few bits of the
predicate register:
bit 0: carry flag
bit 1: zero flag
bit 2: overflow of signed arithmetic
bit 3: sign bit
bit 4: negative = sign xor overflow
If
the predicate register for an add-with-carry instruction is specified
as r0 then the carry-in will be 0, but the arithmetic flags for the
result will be written to r0. Shift and rotate instructions can
output a carry to a predicate register, but may not have a carry-in.
There are no rotate-through-carry instructions, but an add-with-carry
of a register to itself can be used as a rotate-left-through-carry
(Rotate through carry instructions are rarely used anyway, and they
are very inefficient on many processors). Integer and floating point
compare instructions also produce these flags.
The
carry mechanism can be vectorized so that multiple add-with-carry
operations can be executed in parallel.
Branches
Branching
is done with combined arithmetic-and-branch instructions. These are
ALU instructions such as add, subtract, compare, bit test, etc.
combined with a conditional jump, for example: subtract and jump if
not zero, compare and jump if above, test a specific bit and jump if
it is zero. These instructions cannot be vectorized. The vector size
field is used as condition code. There is no need to support
predicated jump instructions because these can be replaced by a
combined bit test and conditional jump instruction. Multiway branches
can be implemented with indirect jump or indirect call.
Debug and interrupt flags
There
are various control registers which can be used for debugging
purposes, interrupt control, etc.
Addressing modes
The
address space uses 64-bit addresses only. Addresses are always
relative to the instruction pointer, stack pointer or a register
pointer. Absolute addressing does not need not be supported. The
following addressing modes are supported:
Instruction
pointer + 32 bit sign-extended offset
Instruction
pointer + index register + 32 bit sign-extended offset
Base
register + 8 or 32 bit sign-extended offset
Base
register + index register + 32 bit sign-extended offset
Base
register + scaled index register + 32 bit sign-extended offset
The
size of data operands or vector elements is always specified in the
instuction. This size is used as a scale factor which is applied to
all 8-bit offsets. For example, if the operand size is 32 bits = 4
bytes, then any 8-bit offset is multiplied by 4. 32-bit offsets are
never scaled. The index register can also be scaled by the operand
size.
Direct
conditional and unconditional jumps and calls are always relative to
the instruction pointer with 8-bit or 32-bit sign-extended offset,
scaled by 4 because all instructions are aligned to addresses
divisible by 4.
CPUID
A
CPUID instruction must have functions for telling whether optional
features are supported, e.g. 128-bit integers, quadruple precision
floating pont, and the maximum vector size for each type of operands.
There should also be features for telling how efficient certain
instructions are, to help software determine the optional coding
version.
Proposed code structure
An
instruction code contains a combination of the fields described
below, where some of the fields can be omitted. The total size of all
the fields must be 32, 64 or 96 bits.
Instruction length: 2 bits.
00 = 1 dword = 32 bits
01 = 1 dword = 32 bits
10 = 2 dwords = 64 bits
11 = 3 dwords = 96 bits
Instruction format: 2 or more bits.
Each
combination of the instruction length and instruction format bits
defines a class of instructions having a particular combination of
the remaining fields. In other words, the combination of instruction
length and instruction format bits determines which of the following
fields are present, and their sizes.
Instruction code: 6 or more bits.
These
bits are used for distinguishing the individual instructions, such as
add, move, jump, etc. The number of instruction code bits is simply
the number of bits not used for anything else. Therefore, the number
of instruction code bits can be different for different instruction
formats. The instruction bits are not necessarily contiguous
if the placement of other fields on fixed positions has higher
priority in the design.
Register: 1 - 4 fields of 5 bits each.
Used for both operand registers, base pointer and index register.
Predicate register: 3 bits.
Specifies
a register used for predicated scalar instructions, masked vector
instructions, and flags input and output. Only r1 - r7 can be used as
predicate register. r0 means no predicate.
Operand size and type: 3 bits.
Defines
the type, size and precision of operands, integer or floating point.
The size for integer operands can be 8, 16, 32, 64, and optionally
128 bits. The precision of floating point operands can be single,
double, and optionally quadruple precision. Half precision is not
supported, except in conversion instructions.
Vector length: 3 bits.
Specifies
the length of vectors in bits. Possible values are: scalar, 128, 256,
512, 1024, 2048, 4096, and max. Support for values above 128 are
optional. The size of operands is as determined by the operand
size/type when "scalar" is specified. A vector will contain
as many elements of the specified operand size as can be contained in
the vector size. For example, a 256 bit vector can contain 8 elements
of 32 bits each. The "max" specification gives the largest
vector size supported by the processor. This value depends on the
processor and must be a power of 2. The minimum allowed value is 128,
with no upper limit. The max value may be different for different
operand sizes. A piece of software can take advantage of future
extensions by specifying the max vector size. The max value can be
reduced by settings in a control register.
Addressing modes: 2 bits.
The
following addressing modes are defined for memory operands. An
instruction can have no more than one explicit memory operand with
this specifiation.
00: IP + index + 32 bits offset (specify r31 for no index)
01: base + 8 or 32 bits offset
10: base + index + 8 or 32 bits offset
11: base + index * operand size + 8 or 32 bits offset
The
base and index registers are specified in register fields. The offset
size (8 or 32 bits) depends on the instruction format. 8-bit offsets
are always multiplied by the specified operand size in bytes. For
example, an operand size of 32 bits = 4 bytes will multiply the value
in the offset field by 4. 32-bit offsets are not multiplied by this
factor. Offsets are always sign-extended. It is not required to
support 16-bit or 64-bit offsets or absolute addressing.
Jumps
and calls have an offset of 8 or 32 bits relative to the instruction
pointer. This offset is multiplied by 4 because all instructions have
sizes that are multiples of 4 bytes.
Address offset: 8 or 32 bits.
This
field is used as specified above under addressing mode.
Immediate data operand: 8, 32 or 64 bits.
An
8-bit immediate value is interpreted as an integer, sign extended to
the specified operand size. The signed value is converted to floating
point if a floating point operation is specified.
A
32-bit or 64-bit immediate is interpreted as an integer for integer
operations or a single or double precision float for floating point operations. The integer
immediate constant is sign-extended if necessary. The floating point
immediate constant is converted to the desired precision if
necessary.
16-bit
immediates are not necessarily supported.
Rounding mode: 2 bits.
Optionally
specifies the rounding mode used in floating point operations and
conversions. Possible values are: round to nearest or even, round
down, round up, truncate towards zero. The default value if there is
no rounding mode field is "round to nearest or even". This
option field is useful in float-to-integer conversion instructions,
but rarely needed in other contexts. May be included in long versions
of floating point instructions.
Exception control: 1 bit.
Enables
or suppresses interrupts in case of numerical errors. This can be
used for controlling exceptions in case of overflow and other errors
in floating point operations. Can also be used for checking for
overflow in integer arithmetic. An unsigned integer compare
instruction with exception enabled can be used for checking if an
array index is out of bounds. This feature may be included in long
versions of arithmetic instructions.
Broadcast: 1 bit.
If
1, specifies that the last source operand is a scalar to be broadcast
into all the vector elements. (Unnecessary when this is an immediate
operand).
Zero masked data: 1 bit.
Specifies
whether masked-out elements are set to zero or left unchanged. This
bit may replace the broadcast bit on instructions with only register
operands, or it may be a separate bit.
Other fields.
Other
fields may be added if specific features are needed. Alternatively,
an immediate data field may be used for specifying additional feature
options.
Formats.
Commonly
used instructions may be implemented in several different formats and
instruction lengths, preferably with the same value in the
instruction code bits. For example, an addition instruction, A = B +
C, might be implemented in the following formats:
3 registers (1 dword).
2 registers and a predicate (1 dword).
1 register and a predicate and an 8-bit immediate (1 dword).
1 register and a memory operand with base and 8 bit offset, scalar
only (1 dword).
2 registers and a predicate and a 32-bit immediate (2 dwords).
1 register and a memory operand with base and 32 bit offset (2
dwords).
2 registers and a predicate and a memory operand with base and index
and 32 bit offset (3 dwords).
2 registers and a predicate and a 64-bit immediate (3 dwords).
The
destination and the first source operand share the same register in
some of these cases. The operand size and vector length bits can be
used for specifying integer or floating point operands of different
sizes and precisions in scalars or vectors of different sizes. In
other words, the different variants of this instruction can be used
for adding a register variable, a memory variable, or a constant to
integers of any size as well as floating point variables of any
precision in scalars and vectors of any size.
Combined
ALU and conditional jump instructions may be implemented in the
following formats:
2 registers of 32-bit integers only, a condition code and an 8 bit
displacement (1 dword).
2 registers, a condition code and a 32 bit displacement (2 dwords).
1 register, an 8 bit immediate, a condition code and a 32 bit
displacement (2 dwords).
These
instructions have no vector length field but a condition code
instead. Floating point operands are not necessarily supported.
3-input
instructions, such as fused multiply-and-add may be implemented in
the following formats:
3 registers and a predicate and option bits (2 dwords).
2 registers and a predicate and a memory operand with base and index
and 32-bit offset and option bits (3 dwords).
The
destination register is the same as one of the source operand
registers. The options include 4 bits for specifying sign change for
even and odd vector elements of the addend and for even and odd
vector elements of the product, respectively. This will cover all
possible combinations of multiplication with addition, subtraction
and alternating add/subtract in a single instruction.
Less
commonly used instructions may be implemented in just one or a few
different formats.
FPGA
The
microprocessor can have an optional FPGA or similar programmable
hardware. This structure is used for making application-specific
instructions or functions, e.g. for coding, encryption, data
compression, signal processing, text processing, etc. This reduces
the need for hard-coded application-specific instructions.
If
the processor has multiple CPU cores then each core may have its own
FPGA. The hardware definition code is stored in its own cache for
each core. The operating system should prevent, as far as possible,
that the same core is used for different tasks that require different
hardware codes. If this cannot be avoided then the code, as well as
the contents of any memory cells in the FPGA, must be saved on each
task switch. This saving may be implemented as lazy, i.e. the swap of
contents is only made when the second task needs the FPGA structure
that contains code for the first task.
Extensibility
The
evolution of the x86 instruction set is full of short-sighted
decisions with no room for future extensions. All kind of weird
patches have been used to extend an instruction set that was never
designed to be extensible. We must learn from these mistakes and
consider the predictable need for future extensions when designing an
instruction set.
There
is reason to suspect that many of the instructions in the current x86
instruction set have been added with short-sighted marketing reasons
in mind. Every new generation of microprocessors must have some new
features that the competitors don't have, or new features that can be
hyped to make customers buy the new version, according to the
marketing logic. Some of these instructions are now obsolete, but
still supported by the hardware.
The
design of a stable instruction set should not be subject to
competition and marketing whims, but designed by an open process with
participation of the international hardware and software community,
similar to the standardization work in other technological areas. A
collective decision process reduces the risk of mistakes and
short-sighted decisions.
The
proposed instruction set is orthogonal, which reduces the number of
different instructions. The inclusion of an FPGA reduces the need for
application-specific instructions.
In
addition to these considerations, it is necessary to add space for
future extensions of the instruction set. Some of the instruction
formats should have a large number of unused instruction code bits
that can be used for future instructions or option bits. A few
instruction format codes should be reserved for future extensions.
All codes that begin with 111, i.e. 11 in the instruction length bits
and 1 in the first bit of the instruction format field, should be
reserved for future extensions. These bits could be used in the
future either for more 3-dword formats with many instruction bits, or
for 4-dword formats. This decision will be left to the future.
An
attempt to execute an instruction with an unknown instruction code
should cause an interrupt (exception) in most cases. This makes it
possible to emulate new instructions on old microprocessors. In some
cases, however, it is desired to make extensions that do not cause
interrupts on microprocessors that don't support them. Historically,
this has been done with extensions that affect performance, but not
functionality, such as memory prefetching and branch prediction
hints. This kind of extensions can be made by using various option
bits in contexts where they previously made no sense, for example
rounding mode bits in an integer instruction. Thus, the processor
should ignore certain unused option bits in certain instructions to
make this kind of performance extensions possible. Also, a small
range of instruction codes should be reserved for future
performance-tuning instructions, which will be ignored on processors
that don't support them. To be more specific, we will have three
categories of unused codes:
Code
reserved for future use. Generates interrupt so that it can be
emulated.
Code
reserved for future use. Generates no interrupt, but behaves as a
nop (no operation). Can be used for future purposes that allow the
code to be ignored on processors that do not support it.
Code
guaranteed to never be used. Will generate interrupt also on all
future processors. Can be used for application-specific emulation or
forced error messages.
Extending
the vector register size
Extension
of the vector register size is straightforward without the need to
define any new instructions. This makes it possible for software to
utilize a new extended vector size even without recompilation. The
software can simply specify the maximum vector size and get
information from a CPUID instruction about what this maximum vector
size is.
We
have seen in the history of x86 processors that the first processor
generation to support a new and larger vector size has typically done
so with poor efficiency. In most cases, it has used a half-size
execution unit twice to do a full-size vector operation. This was not
necessarily a bad design choice because the software that supports a
new instruction set extension typically lags several years behind the
hardware.
The
situation is different with the extension mechanism proposed here.
The software will be able to utilize a vector size extension
immediately. The microprocessor should not allow a larger vector size
than it can execute more efficiently than if software used the next
smaller vector size twice. It may still be worthwhile to allow a
vector size that is larger than the execution unit and use this unit
multiple times to process a full-size vector. This will save
bandwidth in the decoder and use fewer registers than the alternative
of repeating the instruction in software. The CPUID instruction
should provide complete information about this. This means that it
should specify both the maximum vector size that can execute at full
throughput and the maximum vector size that can execute at all. These
values may be different for different types of operands.
Portability
The
ABI, object file format, etc. should be standardized as far as
possible in order to allow the same code to be compatible with
different operating systems and platforms. This would make it
possible, for example, to use the same function libraries in
different operating systems. This can easily be achieved for
libraries that are doing some mathematical operation and not using
any system functions. A more ambitious goal is to establish
portability even when some common system functions are involved, such
as time functions or handling of multithreaded code. Most
importantly, there should be a portable way of generating error
messages from a function library. This could be obtained with an
error message instruction. This instruction should generate an
interrupt (throw an exception). A few register operands can contain
codes indicating the type of error, and a memory operand can point to
an error message text. All platforms should be able to handle this
error condition in a way that is appropriate for the type of user
interface. In console mode applications, for example, the error
message should go to the stderr output. In graphical user interface
(GUI) applications, the error message should be shown in a pop-up
window, or whatever method is appropriate for the specific GUI
framework.
Error
messages should be in the English language by default, with an
optional feature for multi-language support. We can expect the need
for multi-language support to be decreasing. The problems with
multi-language applications are discussed in this
document.
ABI
and calling conventions
This
is an example of how an efficient ABI (Application Binary Interface) can be designed.
Function
calls will use registers for parameters as far as possible. The first
24 parameters to a function are transferred in register r0 - r23. Any
additional parameters are transferred on the stack in C language
order. These parameters are removed from the stack by the caller. The
function return value is in r0. Push and pop instructions are rarely
used. The return instruction has no offset. The stack is kept aligned
by 128 before any call instruction.
Variable
argument lists are transferred on the stack with 64 bits for each
argument.
Tuples:
A structure or class or encapsulated array for which all non-static
elements have the same type is transferred in a single vector
register if the total size does not exceed 128 bits.
Parameters
that do not fit into a single register are transferred by a reference
to a memory object allocated by the caller. This applies to:
structures and classes with elements of different types, or bigger
than 128 bits, or having a non-standard copy constructor or
destructor or virtual member function. It is the responsibility of
the caller to call any copy constructor or destructor. Any parameters
beyond the first 24 parameters are transferred in the same way as if
they were in a register, using 64 or 128 bits of stack space, as
appropriate.
A
return value that does not fit into a register is transferred to a
memory location allocated by the caller. A reference to this memory
location is treated as the first parameter (before any 'this'
pointer).
There
are no registers with callee-save status in the case of static
linking. This is because the called function does not know the
maximum vector register size required by the caller. Instead, there
is a mechanism that allows the caller to know which registers are
modified by the called function. This information is stored in the
object file for static link libraries. The object file format must
support this information, which must be stored in the same file as
the library function in order to make sure it has the right version.
Compilers supporting "whole program optimization" can read
this information from a library file before allocating registers.
This
mechanism cannot be applied to dynamic linking. Instead, dynamic link
libraries are prohibited from modifying certain registers.
The
object file format should be a modified ELF format. Dynamic linking
should use the Windows DLL method rather than the UNIX shared object
method. The code uses position-independent addressing as far as
possible. Any remaining relocation is performed at load time. Symbol
imputation is not used. This eliminates the need for the inefficient
global offset table (GOT) and procedure linkage table (PLT).
Information
used for exception handling and stack unrolling should use a
standardized and portable table-based method. Debugging information
should also be standardized.
Assembly language syntax
The
syntax for x86 assembly code has never been officially standardized,
but each assembler uses its own dialect. The definition of a new
instruction set should include the definition of a standardized
assembly language syntax, preferably with the destination operand
first.
Summary
of advantages
The
instruction set proposed here is a compromise between the RISC and
CISC principles. A RISC instruction set with a fixed instruction size
makes it easy to decode multiple instructions in parallel, but it is
a vaste of precious code cache space. If the fixed instruction size
is big enough to accommodate all possible instruction types, then it
must necessarily be too big for the most common simple instructions
and therefore take up too much space in the code cache. The code
cache is a precious resource because it is impossible to make the
cache bigger without also making it slower. A CISC instruction set
with many different instruction lengths makes it difficult to decode
multiple instructions in parallel, and this can be a serious
bottleneck. The proposed instruction set has just a few standardized
instruction lengths: one, two and three dwords of 32 bits each. The
length of the instruction is determined by the first few bits. This
makes it possible to determine the lengths of multiple instructions
in a single clock cycle by a process that resembles the look-ahead
carry mechanism in binary adders.
The
instruction set is completely orthogonal. An ordinary arithmetic or
logic instruction, such as e.g. addition, can have many different
versions for different types of operands. It can handle integers of
8, 16, 32, 64, and possibly 128 bits, as well as floating point
numbers of single, double, and possibly quadruple precision. The last
source operand can be a register, a memory operand, or an immediate
constant. The same instruction can handle scalars or vectors of any
length. This makes programming simpler and reduces the number of
different instructions.
There
is only one type of register. The same registers can be used for many
different purposes, including integers and floating point numbers of
all different sizes and precisions, as well as for Booleans, flags,
pointers and references. The registers can also be used for vectors
and masks.
Many
instructions can be predicated, so that the instruction is either
executed or not, depending on a Boolean variable stored in a
predicate register. The predicate mechanism can be vectorized, so
that the operation is turned on or off for each element in a vector,
depending on a mask register containing a vector of Booleans.
Instructions
can have short versions that save space by using only two operands,
by omitting certain option bits, by using an 8-bit scaled offset in a
memory operand, or by using a signed 8-bit constant as the immediate
operand. For example, a double precision floating point addition can
have immediate operands of three different sizes: a signed 8-bit
integer which will be converted to floating point, a single precision
float, or a double precision float. This constant is available at an
early stage in the CPU pipeline so that there is enough time for
converting it to the necessary size and precision without delaying
the execution. The need to fetch numeric constants from data memory
is eliminated in most cases because numeric constants can be
contained in the instructions. This will increase the speed and
reduce the load on the data cache. In most cases, the code size will
not be increased by the inclusion of numeric constants in the
instructions because they replace the addresses (typically 32 bits)
of constants stored in data memory, and because it will fit the
constants into smaller formats whenever possible. Immediate constants
can even be used with vector instructions where the same constant
will be applied to all elements in the vector.
The
size of the registers is not fixed in
the design. It is possible to make bigger and more powerful
microprocessors simply by making the registers bigger so that they
can hold larger vectors. This mechanism is orthogonal as well. There
are three bits in the instruction code which determines the
vector length (or a scalar). This makes it possible to write software
for future microprocessors with bigger vector registers that do not
exist yet. Setting the vector length bits to 111 will give the
largest vector size that the microprocessor supports, whatever this
is. This makes it simple to support all vector sizes in the same
piece of software. This feature also makes it possible to save an
entire register even though the maximum register size is not known
when the software is compiled. This can be useful in task switches,
exception handlers, device drivers and system libraries. There is no
limit to how big the maximum vector size can be. A CPUID instruction
will tell the software what the maximum vector size is, and there
will be a feature that enables a software program to reduce the
maximum vector size if it is excessive.
The
conventions for function calling, as well as other ABI details,
should be specified together with the instruction set. This will
improve compatibility and make it possible to use the same function
libraries with different compilers, different programming languages,
and different operating systems. There are 32 registers. This makes
it possible to use registers for function parameters in almost all
cases.
Can
existing instruction sets be fixed?
The
commonly used instruction sets can be divided into two general types,
RISC and CISC. The RISC instruction sets generally have a more or
less fixed instruction size. All instructions have the same number of
bits. The advantage of a RISC design is that the fetching and
decoding of instructions is simple and fast. The disadvantege is that
commonly used simple instructions take more space than necessary
while complicated instructions do not fit into the limited
instruction size. Instructions that need many bits for addresses or
constants do not fit into the RISC design.
A
CISC instruction set has a variable instruction length. The advantage
of this is that simple, commonly used instructions can be as small as
a single byte, while more complex instructions or instructions with
large addresses or constants can have a length that fits the purpose.
This provides optimal use of the code cache. The disadvantage is that
it is complicated to decode the instructions. Modern microprocessors
can execute three or four instructions in parallel in a single clock
cycle if no data dependence prevents this. But it is difficult to
decode multiple instructions simultaneously when you have to
determine the length of the first instruction before you know where
the second instruction begins. Therefore, the bottleneck in a CISC
processor is quite often decoding rather than execution.
The
present article has proposed a compromise between RISC and CISC. The
widely used x86 instruction set is a CISC design. Mosts other
instruction sets in common use today are RISC designs.
x86
instruction set
The
x86 instruction set has a long heritage dating back to the 8086
processor in 1978 where code density was of prime importance. It has
been developed through many generations of additions and extensions
without ever loosing backwards compatibility. It is a CISC
instruction set where instructions can have any length from 1 to 15
bytes, and it is quite complicated to determine the length of each
instruction. It has many different types of registers. The latest
extension, AVX-512 has 16 general purpose registers of 64 bits each,
6 segment registers of which only 3 are used in 64-bit mode, 8
floating point registers of 80 bits each, 8 MMX registers of 64 bits
each which are overlaid on the floating point registers, 32 vector
registers of 512 bits each, 8 mask registers of 64 bits each, a flags
register and an instruction pointer. This patchwork could certainly
need a redesign. Can it be combined with the design principles that
are proposed here?
An
easy solution would allow the two kinds of code to be used
interchangeably and mixed. The new instructions would have to use
some bit patterns that are not already in use in the old system. The
x86 instruction set has 20 byte-codes that are currently used only in
16-bit and 32-bit mode, mostly for obsolete instructions. These codes
can be used for other purposes in 64-bit mode. Therefore, it is
possible to make new extensions that can be used only in 64-bit mode.
We would prefer 64-bit mode anyway, so it would be possible to make
extensions that implement some of the principles described here and
still preserve backwards compatibility, but this would still be only
patches on a fundamentally flawed, inefficient and outdated design.
The 20 unused code bytes are scattered around the code map with only
few adjacent code bytes, so it would be impossible to use more than a
few of these code bytes without making the whole system completely
messy. Most of the bits in the first byte of any new code would
therefore be fixed and unusable in such a hypothetical new code
design.
A
better solution would be to implement a separate mode for the new
instruction design and a system for switching between the new mode
and the legacy modes. The improvement in performance that can be
obtained with a new instruction design is probably not high enough to
justify the complications of a dual code system. Instead, we should
be prepared to seize the opportunity in case the need for a major
revision should arise for other reasons. It is not possible to make a
decoder that translates the old codes to the new ones at runtime,
because the new system does not support the many different types of
registers that the old system has. A translation from the new system
to the old one is also not possible. Instead, we would need two
seperate decoders that translate the old and the new codes,
respectively, to the internal micro-operation format. This
micro-operation format probably needs to be expanded to make space
for 64-bit immediate constants, but the extra bits can be disabled
when they are not needed, in order to save power.
The
existing execution units could relatively easily be modified to
support the new code design. The 32 universal registers of the new
design should obviously be identical to the 32 vector registers of
the old design. Combined ALU-and-conditional-jump instructions are
already implemented internally in both Intel and AMD processors even
though they are not available as x86 instructions.
It
is a problem that many current processors have their execution units
divided into two main clusters: One cluster is connected to the
general purpose registers and handles integer scalar operations,
pointer addressing and jumps. The other cluster is connected to the
floating point and vector registers and handles all floating point
and vector operations. All transfers of data between these two
clusters typically involve a delay of one clock cycle. This
two-cluster design would be a problem for the new instruction set
where all units need access to the same register file.
Itanium
instruction set
The
Itanium instruction set is a very ambitions RISC instruction set.
Itanium instructions are joined into bundles with a fixed size of 128
bits, containing three instruction codes of 41 bits each and a 5-bit
template. The three instructions in a bundle will execute in
parallel. This explicit parallelism puts a lot of work on the
compiler to schedule instructions that can execute in parallel
without violating the program logic. The Itanium has a rotating
register stack where each program function allocates the number of
registers it needs. It has many other advanced features, such as
explicitly speculative instructions. The itanium design has not been
very successful, due mainly to the difficulties of making a suitable
compiler. Another obstackle to the commercial success of the Itanium
was a poor support for backwards compatibility with existing
software. The Itanium system is so different from other systems that
there would be little advantage in combining it with a new
instruction set.
Other
RISC instruction sets
Most
other commonly used instruction sets today are RISC designs. These
designs are generally simple and efficient. The instruction length is
typically 32 bits. Some systems, such as ARM-Thumb-2 and AVR32 can
use a mixture of short 16-bit instructions and longer 32-bit
instructions. Most systems have several different register types.
Some RISC processors support vector instructions with 128-bit
vectors. There is a limit to the number of different instructions
that can be coded in an instruction with a fixed size. It is a
general problem with RISC instruction sets that they cannot support
complex instructions with many option bits. This makes it difficult
to add new options and features that the x86 instruction set has,
such as masked vector operations, options for controlling rounding
mode, etc. The limited instruction size of the RISC systems is also a
problem where large addresses or large numeric constants are needed.
It is not possible to define a large numerical constant or a jump to
a distant address with a single instruction in a RISC design with a
limited instruction size. Some of the RISC processors already have
support for more than one instruction set and features for switching
between these modes. An additional mode for a new instruction set
could be added to these processors without serious problems. |