Agner`s CPU blog

Proposal for an ideal extensible instruction set

Author: Agner

Date: 2016-01-12 07:18

Thank you for all the useful comments, ideas and references. This gives me reason to discuss several possible improvements:

Vector length register

The idea of using an extra register to specify the length of a vector is excellent. This makes it possible to change the length of a vector at run time and to use a length that is not a power of 2. We can still have a 3-bit vector length field in the instruction code, but the meaning will be as follows. A value of 0 means a scalar. A value of 1 - 7 means that the length is specified by one of the registers r9 - r15. It is the responsibility of the compiler (or assembly programmer) to make sure that the specified length does not exceed the maximum length supported by the actual CPU. This will make the hardware a little more complicated, but it will be a great advantage to the software.

The common system with fixed vector sizes has a big problem when vectorizing a loop through an array. Often, the programmer or compiler does not know in advance whether the array size is a multiple of the vector size or not. Therefore, it is necessary to make extra "remainder" code in the end to handle any remaining array elements that don't fit into the vector registers. Another big problem is that there must be different versions of the code for microprocessors with different maximum vector sizes if you want optimal performance.

A system that allows variable vector sizes can use the maximum vector size supported by the microprocessor for all but the last iteration of the loop and then use a smaller vector size for the last iteration if necessary. The performance of such a loop can be further improved if we make a special instruction that finds the vector size as the smallest of the two values: the remaining number of array elements and the maximum vector size. This is easier to implement in software than masking off the rest of the vector.

It will also be possible to make library functions that have vectors of variable length as parameters and result.

It will not be too complicated to implement a variable vector size in hardware. The hardware can simply mask off the unused part of the vector register. There may be a problem with power consumption. A processor can save power when handling a vector of less-than-maximum size by either clock gating the unused part or by turning off power to the unused circuits. If the vector size is specified by an extra register then the information about the actual size of the vector will be available at a rather late stage in the pipeline and the CPU will have less time to adjust its power consumption.

Another problem is the extra dependencies. An instruction can have up to three input operands, a mask register and a vector length register. This gives a total of five input dependencies. The out-of-order scheduling system becomes more complicated if it has to handle this many dependencies.

Nevertheless, I am sure that the gains in software efficiency more than outweighs the hardware costs of supporting a variable vector length.

Alignment of vectors

It is easier for the hardware to write and read vectors to/from memory if the vector size is a power of 2 and the memory operand is aligned to an address that is divisible by the size of the vector. Various systems have different requirements for the alignment of vectors in memory. The requirements for alignment makes the software more complicated. The stack is always the preferred place to store local data in a function. There are two common ways of aligning data on a stack to a value higher than the stack word size:

Keep the stack aligned by the required size and propagate this alignment through the chain of function calls. A function must insert unused space on the stack before calling another function, if needed, in order to keep the stack aligned. The problems with this method are: it wastes stack space; it requires extra instructions in functions even when the alignment is not actually used; it does not take into account future systems with bigger vectors requiring higher alignment; and it may fail if different parts of the code have been compiled for different alignments.
Adjust the stack frame to the required alignment inside any function that needs aligned storage. This requires more instructions than method 1, but only when alignment is actually needed. A further disadvantage is that is uses an extra register for addressing the aligned stack frame or for saving the stack pointer.

Both methods are in use. The current x86-64 systems use method 1 for 16 bytes alignment and method 2 for 32 bytes alignment. Method 2 is probably preferable in a system with plenty of registers.

A third possibility is to modify the hardware so that no alignment is required. Current state-of-the-art microprocessors have no performance penalty for misaligned memory operands, except when a cache line boundary is crossed. I am not in a position to weigh the hardware costs of handling unaligned memory operands against the software costs of aligning data, but this would be the optimal solution if the hardware costs are not too high.

Whatever solution we end up with, there are certain things that should preferably be coordinated: The alignment of memory operands, the stack alignment, the minimum vector size that must be supported, and the required alignment for arrays. If the hardware can support vectors aligned by 8 bytes efficiently, then we may decide the rule that vectors in memory must be aligned by 8 and that all arrays containing at least 8 bytes must be aligned by 8 so that they can be handled in vector registers.

Should vectors and scalars use the same register set?

Two commentators have argued that it is better to have separate registers for scalars and vectors, rather than just one register set as I first proposed. The scalar registers can be saved in a way that is sure to be compatible with future extensions because their size is fixed. This will make it possible to have rules for callee-save registers, i.e. registers that a function must save and restore if they are used. For example, we may decide that scalar registers r24-r31 have callee-save status. Such registers are useful for saving data across a function call. Vector registers cannot have callee-save status because it is difficult to save a vector register in a way that is compatible with future extensions of the vector size. Instead, we will use information stored in object files and static libraries in order to know which vector registers are modified by a function as explained in my first post.

While we may split the registers into vector registers and scalar registers, I don't want a further split into integer and floating point registers. In the case of vectors, there are many instructions that can be applied equally well to integer and floating point data, such as read, write, move, broadcast, blend, permute, gather and scatter. Masks generated with integer instructions can be applied to floating point data. Integer instructions can be used for manipulating floating point values, e.g. manipulating the sign bit, manipulating NAN and INF values, splitting a floating point value into exponent and mantissa, etc.

The same instructions can still be used for both scalars and vectors. A zero in the vector length field of an instruction can indicate that a scalar register is used instead of a vector register. A vector register can still be used for scalar operations if the vector length register contains the value 1. If the same instruction can be used with vector registers and scalar registers, then it follows that the scalar registers can handle both integer and floating point values, just as the vector registers. The only exception is that 128-bit integers and quadruple precision floating point numbers, if supported, can only be handled in vector registers.

The predicate/mask field can indicate either a predicate in a scalar register or a mask in a vector register, depending on whether the vector length field indicates a scalar or a vector. The vector length register is always a scalar register. Instructions that cannot be used with vectors do not need a vector length field.

Function calling conventions

The function calling convention and the order of parameters on the stack needs further discussion. First, we need to decide if the stack should grow downwards, as is common today, or upwards, which would be more logical. The tradition of making the stack grow downwards originates from non-protected operating systems where global data and heap grow up from the bottom of the RAM space while the stack grows downwards from the top until the two meet and the memory is full. This is less relevant in modern systems with virtual memory addresses and multiple threads with each their own stack. I have no compelling reason to prefer the stack to grow upwards or downwards, so let's defer this question. For now, I will assume that the stack grows downwards.

I have proposed that parameters on the stack should be stored in C order. This means that the first parameters are stored at the lowest addresses on a downwards-growing stack. If parameters are stored with push operations, then the last parameter must be pushed first when C order is used. Today, it is common to allocate stack space for the parameters first, and then store the parameters into this stack space, rather than using push instructions. The opposite order, which I will call Pascal order for lack of a better term, has the first parameter pushed first. Pascal order is more logical for assembly programmers if push instructions are used, while C order is more logical if the stack frame method is used because it has the first parameters at the lowest address.

The C parameter order was invented in order to facilitate function calls where the number of parameters was not specified, and in particular for functions with varargs, i.e. a variable number of parameters. I want to propose a completely different solution for varargs. Instead of putting varargs parameters on the stack or in registers, I propose to put them in a list which can be stored anywhere in memory, and only transfer a pointer to this list as a parameter. This has several advantages: No stack space is needed if the pointer can be transferred in a register. The length of the parameter list can be modified at run time. The parameter list can be reused for multiple function calls. And a function can easily forward its parameter list to another function.

If we allow the use of the first 24 scalar registers and the first 24 vector registers for function parameters, then we can have a function call with up to 48 parameters without saving any parameters on the stack. Varargs lists will not use the stack either if a pointer to a list is used. Now the order of parameters on the stack certainly becomes less important. Nobody will use assembly language to call a function with so many parameters, and few compilers will use push instructions, so the argument about the order of push instructions is irrelevant. However, there are still two arguments for using the C order: If the stack grows downwards then the C order will have the first parameter at the lowest address, which is more logical for both caller and callee. The second argument is that the first parameter will be closest the the address pointed to by the stack pointer. If the caller and callee disagree on the number of parameters because they are relying on different versions of a function prototype, or whatever, then the C order will make errors in the last parameters only while the Pascal order will have errors in all the stack parameters. The C order will therefore make it easier to locate such an error.

Another issue is whether the caller or the callee should clear the stack. I will argue that it is safer to let the caller clear the stack. If the callee clears the stack and the caller and callee happen to disagree on the amount of stack space used, then the stack will be corrupted and the system is likely to crash. But the system is not guaranteed to crash in this case. All kinds of unpredictable and disastrous things can happen. It is possible, for example, that a function pointer has been stored on a stack space where the caller expects to find a saved return address after the stack has been corrupted. This will cause it to continue in the pointed-to function and the result will be unpredictable.

Efficiency is not an issue here because the stack clearing rule is relevant only when calling a function with at least 25 parameters, and possibly 49. This is a very rare situation anyway. The adjustment of the stack pointer requires only a single instruction, which is likely to execute out of order. The impact on performance is zero or at least insignificant for such a large function.

The programming language is irrelevant here. These arguments apply to all programming languages. The efficiency of tail calls is not affected by who clears the stack. A tail call will save one stack clearing in both cases.

We need a rule for parameters that do not fit into any type of register. I will propose that such parameters should be stored in memory and a pointer to the parameter should be transferred as a parameter. The function is allowed to modify the data that the pointer points to, unless the parameter is declared const. It is the responsibility of the caller to call any copy constructor or destructor of the parameter. It is an open issue whether a function should be allowed to modify a parameter in a varargs list.

Link register

Some systems store the return address of a function on the stack, while other systems use a link register to hold the return address. The advantage of a link register is that a leaf function can be called without storing anything on the stack. This saves cache bandwidth in programs with many leaf function calls. The disadvantage is that every non-leaf function needs to save the link register on the stack before calling another function and restoring the leaf register before returning.

RISC-V specifies that the link register is one of the general purpose registers. I will argue that if a link register is used then it should be a special register. A link register does not need to support all the things that a general purpose register can do. If the link register is included as a general purpose register then it will be tempting for a programmer to save it to another general purpose register rather than the stack, and then end the function by jumping to that general purpose register. This will work, of course, but it will interfere with the way returns are predicted. The branch prediction mechanism in modern microprocessors use a special mechanism for predicting returns, which is different from the mechanism used for predicting other jumps and branches. This mechanism, which is called a return stack buffer, is a small rolling cache that remembers the addresses of the last calls. If a function returns by a jump to another register than the link register then it will use the wrong prediction mechanism, and this will cause severe delays due to misprediction of the subsequent series of returns.

The only instructions that are needed for the link register other than call and return, are push and pop. We can reduce the number of instructions in non-leaf functions by making a combined instruction for "push link register and then call a function" which can be used for the first function call in a non-leaf function, and another instruction for "pop link register and then return" to end a non-leaf function.

Tail calls will be equally efficient with and without a link register.

High level language support for vectors

C/C++ compilers often have support for vector registers. This includes system-specific types, e.g. __m256 for a 256-bit vector of 8 single-precision floats. If we want high-level language support for a system with variable length vector registers then we must have a system-specific addition to the programming language that defines such vectors, e.g. __vector<float,8> for a vector register of 8 floats. These types can be used as function parameters and function returns. The definition of a function with a vector parameter of a certain size implies that the CPU must support vectors of the specified size. Therefore, it is possible to transfer such a vector parameter in a single register and return such a vector in a single register regardless of its size.

It will also be useful to have a way of specifying vectors of variable size in a high-level language where the size is specified in a separate parameter.

Software pipelining

Support for software pipelining was proposed. This requires a rolling bank of registers that can be allocated to a loop. Software pipelining can improve the performance of complex loops in cases where out-of-order scheduling would otherwise be a bottleneck. Support for software pipelining would be a complicated addition to the architecture, and it would be irrelevant for the quite common cases where performance is limited by memory bandwidth, cache performance, etc. Some of the mechanisms are patented, which could be an obstacle to including it in an open standard.

I think that we should allow experimentation in this area and be prepared for a future extension of the standard with optional support for software pipelining.

Further discussion topics

RISC-V proposes a 128-bit addressing mode. I don't understand what such a huge address space can be used for. 64-bit addressing gives us more than 10¹⁹ bytes of address space. This is more than we will need in any computer in a foreseeable future. Some have argued that 128-bit addresses can be useful in CPU clusters and clouds. But I don't think that a CPU at one node in a cluster should be allowed to directly address a RAM cell at a different node. 128-bit addresses would mean that every entry on the stack would use 16 bytes, most of which would be zero. This would be a waste of precious cache space. The system would certainly be simpler if the only allowed addressing mode is 64 bits.

McCalpin proposes a reconfigurable register space. This would add a lot of complication to the instruction set as well as to the hardware. I am suspecting that the hardware necessary for supporting register reconfiguration would take up more silicon space than simply allowing all the registers to have full size.

Reply To This Message

Previous Message

Next Message