Thank you for all the useful comments, ideas and references.
This gives me reason to discuss several possible improvements:
Vector length register
The idea of using an extra register to specify the length of a vector is
excellent. This makes it possible to change the length of a vector at run
time and to use a length that is not a power of 2. We can still have a
3-bit vector length field in the instruction code, but the meaning will be
as follows. A value of 0 means a scalar. A value of 1 - 7 means that the
length is specified by one of the registers r9 - r15. It is the
responsibility of the compiler (or assembly programmer) to make sure that
the specified length does not exceed the maximum length supported by the
actual CPU. This will make the hardware a little more complicated, but it
will be a great advantage to the software.
The common system with fixed vector sizes has a big problem when
vectorizing a loop through an array. Often, the programmer or compiler
does not know in advance whether the array size is a multiple of the
vector size or not. Therefore, it is necessary to make extra "remainder"
code in the end to handle any remaining array elements that don't fit into
the vector registers. Another big problem is that there must be different
versions of the code for microprocessors with different maximum vector
sizes if you want optimal performance.
A system that allows variable vector sizes can use the maximum vector
size supported by the microprocessor for all but the last iteration of the
loop and then use a smaller vector size for the last iteration if
necessary. The performance of such a loop can be further improved if we
make a special instruction that finds the vector size as the smallest of
the two values: the remaining number of array elements and the maximum
vector size. This is easier to implement in software than masking off the
rest of the vector.
It will also be possible to make library functions that have vectors of
variable length as parameters and result.
It will not be too complicated to implement a variable vector size in
hardware. The hardware can simply mask off the unused part of the vector
register. There may be a problem with power consumption. A processor can
save power when handling a vector of less-than-maximum size by either
clock gating the unused part or by turning off power to the unused
circuits. If the vector size is specified by an extra register then the
information about the actual size of the vector will be available at a
rather late stage in the pipeline and the CPU will have less time to
adjust its power consumption.
Another problem is the extra dependencies. An instruction can have up to
three input operands, a mask register and a vector length register. This
gives a total of five input dependencies. The out-of-order scheduling
system becomes more complicated if it has to handle this many
dependencies.
Nevertheless, I am sure that the gains in software efficiency more than
outweighs the hardware costs of supporting a variable vector length.
Alignment of vectors
It is easier for the hardware to write and read vectors to/from memory if
the vector size is a power of 2 and the memory operand is aligned to an
address that is divisible by the size of the vector. Various systems have
different requirements for the alignment of vectors in memory. The
requirements for alignment makes the software more complicated. The stack
is always the preferred place to store local data in a function. There are
two common ways of aligning data on a stack to a value higher than the
stack word size:
- Keep the stack aligned by the required size and propagate this
alignment through the chain of function calls. A function must insert
unused space on the stack before calling another function, if needed, in
order to keep the stack aligned. The problems with this method are: it
wastes stack space; it requires extra instructions in functions even
when the alignment is not actually used; it does not take into account
future systems with bigger vectors requiring higher alignment; and it
may fail if different parts of the code have been compiled for different
alignments.
- Adjust the stack frame to the required alignment inside any function
that needs aligned storage. This requires more instructions than method
1, but only when alignment is actually needed. A further disadvantage is
that is uses an extra register for addressing the aligned stack frame or
for saving the stack pointer.
Both methods are in use. The current x86-64 systems use method 1 for 16
bytes alignment and method 2 for 32 bytes alignment. Method 2 is probably
preferable in a system with plenty of registers.
A third possibility is to modify the hardware so that no alignment is
required. Current state-of-the-art microprocessors have no performance
penalty for misaligned memory operands, except when a cache line boundary
is crossed. I am not in a position to weigh the hardware costs of handling
unaligned memory operands against the software costs of aligning data, but
this would be the optimal solution if the hardware costs are not too high.
Whatever solution we end up with, there are certain things that should
preferably be coordinated: The alignment of memory operands, the stack
alignment, the minimum vector size that must be supported, and the
required alignment for arrays. If the hardware can support vectors aligned
by 8 bytes efficiently, then we may decide the rule that vectors in memory
must be aligned by 8 and that all arrays containing at least 8 bytes must
be aligned by 8 so that they can be handled in vector registers.
Should vectors and scalars use the same register set?
Two commentators have argued that it is better to have separate registers
for scalars and vectors, rather than just one register set as I first
proposed. The scalar registers can be saved in a way that is sure to be
compatible with future extensions because their size is fixed. This will
make it possible to have rules for callee-save registers, i.e. registers
that a function must save and restore if they are used. For example, we
may decide that scalar registers r24-r31 have callee-save status. Such
registers are useful for saving data across a function call. Vector
registers cannot have callee-save status because it is difficult to save a
vector register in a way that is compatible with future extensions of the
vector size. Instead, we will use information stored in object files and
static libraries in order to know which vector registers are modified by a
function as explained in my first post.
While we may split the registers into vector registers and scalar
registers, I don't want a further split into integer and floating point
registers. In the case of vectors, there are many instructions that can be
applied equally well to integer and floating point data, such as read,
write, move, broadcast, blend, permute, gather and scatter. Masks
generated with integer instructions can be applied to floating point data.
Integer instructions can be used for manipulating floating point values,
e.g. manipulating the sign bit, manipulating NAN and INF values, splitting
a floating point value into exponent and mantissa, etc.
The same instructions can still be used for both scalars and vectors. A
zero in the vector length field of an instruction can indicate that a
scalar register is used instead of a vector register. A vector register
can still be used for scalar operations if the vector length register
contains the value 1. If the same instruction can be used with vector
registers and scalar registers, then it follows that the scalar registers
can handle both integer and floating point values, just as the vector
registers. The only exception is that 128-bit integers and quadruple
precision floating point numbers, if supported, can only be handled in
vector registers.
The predicate/mask field can indicate either a predicate in a scalar
register or a mask in a vector register, depending on whether the vector
length field indicates a scalar or a vector. The vector length register is
always a scalar register. Instructions that cannot be used with vectors do
not need a vector length field.
Function calling conventions
The function calling convention and the order of parameters on the stack
needs further discussion. First, we need to decide if the stack should
grow downwards, as is common today, or upwards, which would be more
logical. The tradition of making the stack grow downwards originates from
non-protected operating systems where global data and heap grow up from
the bottom of the RAM space while the stack grows downwards from the top
until the two meet and the memory is full. This is less relevant in modern
systems with virtual memory addresses and multiple threads with each their
own stack. I have no compelling reason to prefer the stack to grow upwards
or downwards, so let's defer this question. For now, I will assume that
the stack grows downwards.
I have proposed that parameters on the stack should be stored in C order.
This means that the first parameters are stored at the lowest addresses on
a downwards-growing stack. If parameters are stored with push operations,
then the last parameter must be pushed first when C order is used. Today,
it is common to allocate stack space for the parameters first, and then
store the parameters into this stack space, rather than using push
instructions. The opposite order, which I will call Pascal order for lack
of a better term, has the first parameter pushed first. Pascal order is
more logical for assembly programmers if push instructions are used, while
C order is more logical if the stack frame method is used because it has
the first parameters at the lowest address.
The C parameter order was invented in order to facilitate function calls
where the number of parameters was not specified, and in particular for
functions with varargs, i.e. a variable number of parameters. I want to
propose a completely different solution for varargs. Instead of putting
varargs parameters on the stack or in registers, I propose to put them in
a list which can be stored anywhere in memory, and only transfer a pointer
to this list as a parameter. This has several advantages: No stack space
is needed if the pointer can be transferred in a register. The length of
the parameter list can be modified at run time. The parameter list can be
reused for multiple function calls. And a function can easily forward its
parameter list to another function.
If we allow the use of the first 24 scalar registers and the first 24
vector registers for function parameters, then we can have a function call
with up to 48 parameters without saving any parameters on the stack.
Varargs lists will not use the stack either if a pointer to a list is
used. Now the order of parameters on the stack certainly becomes less
important. Nobody will use assembly language to call a function with so
many parameters, and few compilers will use push instructions, so the
argument about the order of push instructions is irrelevant. However,
there are still two arguments for using the C order: If the stack grows
downwards then the C order will have the first parameter at the lowest
address, which is more logical for both caller and callee. The second
argument is that the first parameter will be closest the the address
pointed to by the stack pointer. If the caller and callee disagree on the
number of parameters because they are relying on different versions of a
function prototype, or whatever, then the C order will make errors in the
last parameters only while the Pascal order will have errors in all the
stack parameters. The C order will therefore make it easier to locate such
an error.
Another issue is whether the caller or the callee should clear the stack.
I will argue that it is safer to let the caller clear the stack. If the
callee clears the stack and the caller and callee happen to disagree on
the amount of stack space used, then the stack will be corrupted and the
system is likely to crash. But the system is not guaranteed to crash in
this case. All kinds of unpredictable and disastrous things can happen. It
is possible, for example, that a function pointer has been stored on a
stack space where the caller expects to find a saved return address after
the stack has been corrupted. This will cause it to continue in the
pointed-to function and the result will be unpredictable.
Efficiency is not an issue here because the stack clearing rule is
relevant only when calling a function with at least 25 parameters, and
possibly 49. This is a very rare situation anyway. The adjustment of the
stack pointer requires only a single instruction, which is likely to
execute out of order. The impact on performance is zero or at least
insignificant for such a large function.
The programming language is irrelevant here. These arguments apply to all
programming languages. The efficiency of tail calls is not affected by who
clears the stack. A tail call will save one stack clearing in both cases.
We need a rule for parameters that do not fit into any type of register.
I will propose that such parameters should be stored in memory and a
pointer to the parameter should be transferred as a parameter. The
function is allowed to modify the data that the pointer points to, unless
the parameter is declared const. It is the responsibility of the caller to
call any copy constructor or destructor of the parameter. It is an open
issue whether a function should be allowed to modify a parameter in a
varargs list.
Link register
Some systems store the return address of a function on the stack, while
other systems use a link register to hold the return address. The
advantage of a link register is that a leaf function can be called without
storing anything on the stack. This saves cache bandwidth in programs with
many leaf function calls. The disadvantage is that every non-leaf function
needs to save the link register on the stack before calling another
function and restoring the leaf register before returning.
RISC-V specifies that the link register is one of the general purpose
registers. I will argue that if a link register is used then it should be
a special register. A link register does not need to support all the
things that a general purpose register can do. If the link register is
included as a general purpose register then it will be tempting for a
programmer to save it to another general purpose register rather than the
stack, and then end the function by jumping to that general purpose
register. This will work, of course, but it will interfere with the way
returns are predicted. The branch prediction mechanism in modern
microprocessors use a special mechanism for predicting returns, which is
different from the mechanism used for predicting other jumps and branches.
This mechanism, which is called a return stack buffer, is a small rolling
cache that remembers the addresses of the last calls. If a function
returns by a jump to another register than the link register then it will
use the wrong prediction mechanism, and this will cause severe delays due
to misprediction of the subsequent series of returns.
The only instructions that are needed for the link register other than
call and return, are push and pop. We can reduce the number of
instructions in non-leaf functions by making a combined instruction for
"push link register and then call a function" which can be used for the
first function call in a non-leaf function, and another instruction for
"pop link register and then return" to end a non-leaf function.
Tail calls will be equally efficient with and without a link register.
High level language support for vectors
C/C++ compilers often have support for vector registers. This includes
system-specific types, e.g. __m256 for a 256-bit vector of 8
single-precision floats. If we want high-level language support for a
system with variable length vector registers then we must have a
system-specific addition to the programming language that defines such
vectors, e.g. __vector<float,8> for a vector register of 8 floats.
These types can be used as function parameters and function returns. The
definition of a function with a vector parameter of a certain size implies
that the CPU must support vectors of the specified size. Therefore, it is
possible to transfer such a vector parameter in a single register and
return such a vector in a single register regardless of its size.
It will also be useful to have a way of specifying vectors of variable
size in a high-level language where the size is specified in a separate
parameter.
Software pipelining
Support for software pipelining was proposed. This requires a rolling
bank of registers that can be allocated to a loop. Software pipelining can
improve the performance of complex loops in cases where out-of-order
scheduling would otherwise be a bottleneck. Support for software
pipelining would be a complicated addition to the architecture, and it
would be irrelevant for the quite common cases where performance is
limited by memory bandwidth, cache performance, etc. Some of the
mechanisms are patented, which could be an obstacle to including it in an
open standard.
I think that we should allow experimentation in this area and be prepared
for a future extension of the standard with optional support for software
pipelining.
Further discussion topics
RISC-V proposes a 128-bit addressing mode. I don't understand what such a
huge address space can be used for. 64-bit addressing gives us more than 1019
bytes of address space. This is more than we will need in any computer in a
foreseeable future. Some have argued that 128-bit addresses can be useful in
CPU clusters and clouds. But I don't think that a CPU at one node in a
cluster should be allowed to directly address a RAM cell at a different
node. 128-bit addresses would mean that every entry on the stack would use
16 bytes, most of which would be zero. This would be a waste of precious
cache space. The system would certainly be simpler if the only allowed
addressing mode is 64 bits.
McCalpin proposes a reconfigurable register space. This would add a lot
of complication to the instruction set as well as to the hardware. I am
suspecting that the hardware necessary for supporting register
reconfiguration would take up more silicon space than simply allowing all
the registers to have full size. |