Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Proposal for an ideal extensible instruction set - Agner - 2015-12-27
replythread Itanium - Ethan - 2015-12-28
last reply Itanium - Agner - 2015-12-28
replythread Proposal for an ideal extensible instruction set - hagbardCeline - 2015-12-28
last replythread Proposal for an ideal extensible instruction set - Agner - 2015-12-28
reply Proposal for an ideal extensible instruction set - Adrian Bocaniciu - 2016-01-04
reply Proposal for an ideal extensible instruction set - Adrian Bocaniciu - 2016-01-04
reply Proposal for an ideal extensible instruction set - Adrian Bocaniciu - 2016-01-04
replythread Proposal for an ideal extensible instruction set - Adrian Bocaniciu - 2016-01-04
reply Proposal for an ideal extensible instruction set - Adrian Bocaniciu - 2016-01-05
replythread Proposal for an ideal extensible instruction set - John D. McCalpin - 2016-01-05
last reply Proposal for an ideal extensible instruction set - Adrian Bocaniciu - 2016-01-06
last reply Proposal for an ideal extensible instruction set - Ook - 2016-01-05
last reply Proposal for an ideal extensible instruction set - acppcoder - 2016-03-27
reply Proposal for an ideal extensible instruction set - Jake Stine - 2016-01-11
replythread Proposal for an ideal extensible instruction set - Agner - 2016-01-12
last replythread Proposal for an ideal extensible instruction set - Jonathan Morton - 2016-02-02
last replythread Proposal for an ideal extensible instruction set - Agner - 2016-02-03
last replythread Proposal for an ideal extensible instruction set - Jonathan Morton - 2016-02-12
last replythread Proposal for an ideal extensible instruction set - Hubert Lamontagne - 2016-02-18
last replythread Proposal for an ideal extensible instruction set - Agner - 2016-02-21
last replythread Proposal for an ideal extensible instruction set - Hubert Lamontagne - 2016-02-22
last replythread Proposal for an ideal extensible instruction set - Agner - 2016-02-23
replythread Proposal for an ideal extensible instruction set - Hubert Lamontagne - 2016-02-23
last replythread Proposal for an ideal extensible instruction set - Agner - 2016-02-24
last replythread Proposal for an ideal extensible instruction set - asdf - 2016-02-24
last reply Proposal for an ideal extensible instruction set - Agner - 2016-02-24
last reply Proposal for an ideal extensible instruction set - Agner - 2016-02-25
replythread limit instruction length to power of 2 - A-11 - 2016-02-24
last replythread limit instruction length to power of 2 - Agner - 2016-02-24
replythread Any techniques for more than 2 loads per cycle? - Hubert Lamontagne - 2016-02-24
last reply Any techniques for more than 2 loads per cycle? - Agner - 2016-02-25
last replythread limit instruction length to power of 2 - A-11 - 2016-02-25
last reply limit instruction length to power of 2 - Hubert Lamontagne - 2016-02-25
replythread More ideas - Agner - 2016-03-04
replythread More ideas - Hubert Lamontagne - 2016-03-07
last reply More ideas - Agner - 2016-03-08
last reply More ideas - Agner - 2016-03-09
replythread Proposal for an ideal extensible instruction set - Joe Duarte - 2016-03-07
reply Proposal for an ideal extensible instruction set - Agner - 2016-03-08
last replythread Proposal for an ideal extensible instruction set - Hubert Lamontagne - 2016-03-08
last replythread Proposal for an ideal extensible instruction set - Joe Duarte - 2016-03-09
last replythread Proposal for an ideal extensible instruction set - Agner - 2016-03-10
last replythread Proposal for an ideal extensible instruction set - Hubert Lamontagne - 2016-03-11
last replythread Proposal for an ideal extensible instruction set - Agner - 2016-03-11
last replythread Proposal for an ideal extensible instruction set - anon2718 - 2016-03-13
last reply Proposal for an ideal extensible instruction set - Agner - 2016-03-14
replythread A design without a TLB - Agner - 2016-03-11
replythread A design without a TLB - Hubert Lamontagne - 2016-03-11
reply A design without a TLB - Agner - 2016-03-11
last reply A design without a TLB - Agner - 2016-03-12
reply A design without a TLB - Bigos - 2016-03-13
last reply A design without a TLB - Agner - 2016-03-28
replythread Proposal now published - Agner - 2016-03-22
last replythread Proposal now published - Hubert Lamontagne - 2016-03-23
last replythread Proposal now published - Agner - 2016-03-24
last replythread Proposal now published - Hubert Lamontagne - 2016-03-24
last replythread Proposal now published - Agner - 2016-03-24
last replythread Proposal now published - Hubert Lamontagne - 2016-03-24
last replythread Proposal now published - Agner - 2016-03-25
last replythread Proposal now published - Hubert Lamontagne - 2016-03-28
last replythread Proposal now published - Agner - 2016-03-29
last replythread Proposal now published - Hubert Lamontagne - 2016-03-30
last replythread Proposal now published - Agner - 2016-03-30
last replythread Do we need instructions with two outputs? - Agner - 2016-03-31
last replythread Do we need instructions with two outputs? - Hubert Lamontagne - 2016-04-01
reply Do we need instructions with two outputs? - Agner - 2016-04-01
replythread Do we need instructions with two outputs? - Joe Duarte - 2016-04-02
last replythread Do we need instructions with two outputs? - Agner - 2016-04-02
last reply Do we need instructions with two outputs? - Joe Duarte - 2016-04-02
last replythread Do we need instructions with two outputs? - Agner - 2016-04-02
last replythread Do we need instructions with two outputs? - Hubert Lamontagne - 2016-04-02
last replythread Do we need instructions with two outputs? - Agner - 2016-04-03
reply Do we need instructions with two outputs? - Joe Duarte - 2016-04-03
last replythread Do we need instructions with two outputs? - Hubert Lamontagne - 2016-04-03
last replythread Do we need instructions with two outputs? - Agner - 2016-04-04
reply Do we need instructions with two outputs? - Hubert Lamontagne - 2016-04-04
last replythread Do we need instructions with two outputs? - Joe Duarte - 2016-04-06
last replythread Do we need instructions with two outputs? - Hubert Lamontagne - 2016-04-07
last replythread Do we need instructions with two outputs? - HarryDev - 2016-04-08
last reply Do we need instructions with two outputs? - Hubert Lamontagne - 2016-04-09
replythread How about stack machine ISA? - A-11 - 2016-04-10
last replythread treating stack ISA as CISC architecure - A-11 - 2016-04-14
last replythread treating stack ISA as CISC architecure - Agner - 2016-04-14
last replythread treating stack ISA as CISC architecure - A-11 - 2016-04-17
replythread treating stack ISA as CISC architecure - Hubert Lamontagne - 2016-04-17
last replythread stack ISA versus long vectors - Agner - 2016-04-18
last replythread stack ISA versus long vectors - Hubert Lamontagne - 2016-04-19
last reply stack ISA versus long vectors - Agner - 2016-04-20
last reply treating stack ISA as CISC architecure - A-11 - 2016-04-18
replythread Proposal for an ideal extensible instruction set - zboson - 2016-04-11
last replythread Proposal for an ideal extensible instruction set - Agner - 2016-04-11
last replythread Proposal for an ideal extensible instruction set - Hubert Lamontagne - 2016-04-11
last replythread Proposal for an ideal extensible instruction set - Agner - 2016-04-12
last reply Proposal for an ideal extensible instruction set - Hubert Lamontagne - 2016-04-12
replythread Version 1.01 - Agner - 2016-05-10
last replythread Version 1.01 - Hubert Lamontagne - 2016-05-13
last replythread Version 1.01 - Agner - 2016-05-14
last replythread Version 1.01 - Harry - 2016-06-02
replythread Public repository - Agner - 2016-06-02
reply Public repository - Harry - 2016-06-02
last reply Public repository - Harry - 2016-06-02
last reply Public repository - Agner - 2016-06-09
replythread Rethinking DLLs and shared objects - Agner - 2016-05-20
replythread Rethinking DLLs and shared objects - cv - 2016-05-20
last reply Rethinking DLLs and shared objects - Agner - 2016-05-20
replythread Rethinking DLLs and shared objects - Peter Cordes - 2016-05-30
last replythread Rethinking DLLs and shared objects - Agner - 2016-05-30
last replythread Rethinking DLLs and shared objects - Joe Duarte - 2016-06-17
last replythread Rethinking DLLs and shared objects - Agner - 2016-06-18
last reply Rethinking DLLs and shared objects - Bigos - 2016-06-18
last replythread Rethinking DLLs and shared objects - Freddie Witherden - 2016-06-02
last replythread Rethinking DLLs and shared objects - Agner - 2016-06-04
last replythread Rethinking DLLs and shared objects - Freddie Witherden - 2016-06-04
last reply Rethinking DLLs and shared objects - Agner - 2016-06-06
replythread Is it better to have two stacks? - Agner - 2016-06-05
reply Is it better to have two stacks? - Hubert Lamontagne - 2016-06-07
replythread Is it better to have two stacks? - Eden Segal - 2016-06-13
last replythread Is it better to have two stacks? - Agner - 2016-06-13
last replythread Is it better to have two stacks? - Hubert Lamontagne - 2016-06-14
last replythread Is it better to have two stacks? - Agner - 2016-06-14
last replythread Is it better to have two stacks? - Hubert Lamontagne - 2016-06-15
last replythread Is it better to have two stacks? - Agner - 2016-06-15
last replythread Is it better to have two stacks? - Hubert Lamontagne - 2016-06-16
last replythread Is it better to have two stacks? - Agner - 2016-06-16
last reply Is it better to have two stacks? - Hubert Lamontagne - 2016-06-17
last reply Is it better to have two stacks? - Freddie Witherden - 2016-06-22
last reply Now on Github - Agner - 2016-06-26
 
Proposal for an ideal extensible instruction set
Author: Agner Date: 2015-12-27 06:46

Introduction

An instruction set is a standardized set of machine instructions that a computer can run.

There are many instruction sets in use. To introduce a new instruction set is not an easy thing to do because it breaks the compatibility with existing software and hardware. Therefore, the successful introduction of a new instruction set is a rare occurrence in the evolution of computer technology, while extensions to existing instruction sets occurs frequently. Some commonly used instruction sets are poorly designed from the beginning and amended with many extensions and patches. One of the worst cases is the widely used x86 instruction set family. This instruction set is the result of a long series of short-sighted extensions and patches. The result of this development is a very complicated code system which is very difficult and costly to decode in a microprocessor. We need to learn from past failures in order to be prepared to make better choices from the start, in case the opportunity to design a new instruction set should come up. The purpose of this article is to construct an example of a new instruction set that is better designed from the start, based on the experience we have with existing instruction sets. The following principles are important to have in mind:

  • The instruction set should have a simple and consistent design.
  • The instruction set should represent a suitable compromise between the RISC principle that enables fast decoding, and the CISC principle that makes more efficient use of code cache resources.
  • The design should be expandable so that new instructions and extensions can be added in a consistent and predictable way.
  • The instruction set should be designed through an open process with the participation of the international hardware and software community.
  • The instruction set should be non-proprietary and allow anybody to make compatible software, hardware and equipment for test, debugging and emulation.
  • Decisions about design and extensions should not be determined by the short term marketing considerations of an oligopolistic microprocessor industry but by the long term needs of the entire hardware and software community and NGOs.
  • The design should allow appliction-specific extensions.

The problems with the x86 instruction set are discussed in my blog article Stop the instruction set war. See also Krste Asanović and David Patterson: "The Case for Open Instruction Sets. Open ISA Would Enable Free Competition in Processor Design". Microprocessor Report, August 18, 2014.

Basic architecture

Instruction format

A pure RISC instruction set has the advantage that all instructions have the same length. This makes it easy to decode multiple instructions in parallel. But it has the disadvantage of using a lot of precious space in the code cache. A CISC instruction format can have a variable instruction length. The well-known x86 format allows instructions of any length from 1 to 15 bytes. This makes the code more compact, but it is very complicated and expensive to decode. It is difficult for the microprocessor to decode multiple instructions in parallel because it needs to find the length of the first instruction before it knows where the second instruction begins, and the instruction length is determined by a complicated algorithm involving many elements of the instruction. Instruction decoding is therefore often a serious bottleneck.

The proposed instruction format is a compromise between these two principles. Instructions can have a few standardized lengths and formats, and the determination of the length is simple. This allows for smaller instructions to save size, and longer instructions when there is a need for more bits for address, data, registers or extra options. Many instructions exist in multiple versions with different sizes. The instruction format is completely orthogonal in the sense that the same instruction can be specified with register, memory or immediate operands, different integer sizes, different floating point precisions, different vector lengths, and different addressing modes.

An instruction can use one, two or three dwords of 32 bits each - that is 32, 64 or 96 bits. No other sizes are permitted. Instructions must be aligned to dword addresses. The first two bits (most significant bits) of the first dword of an in struction indicates the length:

00 = 1 dword
01 = 1 dword
10 = 2 dwords
11 = 3 dwords

In order to further save space in the code cache, there can be certain instructions which do multiple operations in a single short instruction, such as:

  • Set multiple registers to the same constant value (typically zero). The first register and the number of registers is specified.
  • Read multiple registers from consecutive memory addresses. Optionally increment pointer by the size. Can use the stack pointer or any other pointer register.
  • Save multiple registers to consecutive memory addresses. Optionally increment or decrement pointer by the size. Can use the stack pointer or any other pointer register.
  • Combined arithmetic operation and conditional jump.

A dword of all zeroes is a nop (no operation). The processor is allowed to skip nop's as fast as it can. These nop's can be used as fillers, but not as timing delays.

Registers

There are 32 universal registers, named r0 - r31. The proposed instruction set has only one type of registers. These registers can be used for all types of data: Boolean, 8-, 16-, 32-, 64- and (optionally) 128-bit signed and unsigned integers, floating point numbers with single, double and (optionally) quadruple precision, pointers, references, flags and predication masks. This reduces the number of different instructions because the same instruction can be used on different types of data, and because no instructions are needed for transferring data from one type of register to another. For example, the same 'AND' instruction can be used for operations on Booleans, for manipulating bits in integers, for manipulating the sign bit of floating point numbers, and for manipulating predication masks.

The same registers can also be used as vectors of any of these data types. The microprocessor must support vectors of at least 128 bits. Support for larger sizes is optional. Vector sizes up to 8192 bits can be specified by 3 bits in the instruction code. It is also possible to specify the largest available vector size in an instruction. This can be anything from 128 bits and up, with no upper limit. Software can take advantage of future extensions by specifying the largest available vector size. The largest size can be modified by a control register to any power of 2 from 128 to the largest size supported by the microprocessor.

The unused part of a register is always set to zero whenever a register is modified. No instruction leaves part of a register unchanged except for instructions intended for blending or interleaving data. This is important in order to avoid false dependencies on the previous value of the full register, which is known to cause serious performance problems in some existing processors (known as partial register stall). The processor does not actually need to spend power on setting all the superfluous bits to zero. Typically, it will simply turn off the unused parts of execution units and data buses in order to save power.

Stack

There is one stack. The stack register is r31. Including the stack register as one of the universal registers makes it possible to use it as a base pointer in memory addressing and to modify the stack frame with arithmetic instructions. The stack register needs only be 64 bits.

Instruction pointer

The instruction pointer is 64 bits. It is not included in the universal registers. The reason for this decision is to avoid the possible modification of the instruction pointer by arithmetic instructions, which would make branch prediction difficult.

Flags

There is no dedicated flags register. Registers r1 - r7 can be used as predicate registers or mask registers. Many instructions can be predicated. A predicated non-vector instruction will use one of these registers as predicate, and execute the instruction only when bit 0 in the predicate register is 1. The predicate register is thus also a Boolean variable. Execution is unconditional when r0 is specified as the predicate register.

The predication mechanism can be vectorized. A predicate vector is also known as a mask. A masked vector instruction works in the following way. Each element in the vector is processed only if the corresponding element in the mask register has 1 in its least significant bit. The mask register is treated as a vector of Booleans, where each element in the Boolean vector has the same number of bits as the data vectors in the instruction, and only the least significant bit in each Boolean vector element is used, while the remaining bits are ignored. (Other systems use the most significant bit, or all bits, in the mask, but it is preferred to use the least significant bit for the sake of compatibility between Boolean scalars and Boolean vectors). Results that are masked off are either unchanged or set to zero, depending on the instruction. Some instructions support both options to be selected with a feature bit.

Instructions for extended precision arithmetic, such as add-with-carry and subtract-with borrow work in the following way. One register is specified in the predicate register field of the instruction code. Bit 0 of this register is used as both carry-in and carry-out. The traditional arithmetic flags are written to a few bits of the predicate register:

bit 0: carry flag
bit 1: zero flag
bit 2: overflow of signed arithmetic
bit 3: sign bit
bit 4: negative = sign xor overflow

If the predicate register for an add-with-carry instruction is specified as r0 then the carry-in will be 0, but the arithmetic flags for the result will be written to r0. Shift and rotate instructions can output a carry to a predicate register, but may not have a carry-in. There are no rotate-through-carry instructions, but an add-with-carry of a register to itself can be used as a rotate-left-through-carry (Rotate through carry instructions are rarely used anyway, and they are very inefficient on many processors). Integer and floating point compare instructions also produce these flags.

The carry mechanism can be vectorized so that multiple add-with-carry operations can be executed in parallel.

Branches

Branching is done with combined arithmetic-and-branch instructions. These are ALU instructions such as add, subtract, compare, bit test, etc. combined with a conditional jump, for example: subtract and jump if not zero, compare and jump if above, test a specific bit and jump if it is zero. These instructions cannot be vectorized. The vector size field is used as condition code. There is no need to support predicated jump instructions because these can be replaced by a combined bit test and conditional jump instruction. Multiway branches can be implemented with indirect jump or indirect call.

Debug and interrupt flags

There are various control registers which can be used for debugging purposes, interrupt control, etc.

Addressing modes

The address space uses 64-bit addresses only. Addresses are always relative to the instruction pointer, stack pointer or a register pointer. Absolute addressing does not need not be supported. The following addressing modes are supported:

  • Instruction pointer + 32 bit sign-extended offset
  • Instruction pointer + index register + 32 bit sign-extended offset
  • Base register + 8 or 32 bit sign-extended offset
  • Base register + index register + 32 bit sign-extended offset
  • Base register + scaled index register + 32 bit sign-extended offset

The size of data operands or vector elements is always specified in the instuction. This size is used as a scale factor which is applied to all 8-bit offsets. For example, if the operand size is 32 bits = 4 bytes, then any 8-bit offset is multiplied by 4. 32-bit offsets are never scaled. The index register can also be scaled by the operand size.

Direct conditional and unconditional jumps and calls are always relative to the instruction pointer with 8-bit or 32-bit sign-extended offset, scaled by 4 because all instructions are aligned to addresses divisible by 4.

CPUID

A CPUID instruction must have functions for telling whether optional features are supported, e.g. 128-bit integers, quadruple precision floating pont, and the maximum vector size for each type of operands. There should also be features for telling how efficient certain instructions are, to help software determine the optional coding version.

Proposed code structure

An instruction code contains a combination of the fields described below, where some of the fields can be omitted. The total size of all the fields must be 32, 64 or 96 bits.

Instruction length: 2 bits.

00 = 1 dword = 32 bits
01 = 1 dword = 32 bits
10 = 2 dwords = 64 bits
11 = 3 dwords = 96 bits

Instruction format: 2 or more bits.

Each combination of the instruction length and instruction format bits defines a class of instructions having a particular combination of the remaining fields. In other words, the combination of instruction length and instruction format bits determines which of the following fields are present, and their sizes.

Instruction code: 6 or more bits.

These bits are used for distinguishing the individual instructions, such as add, move, jump, etc. The number of instruction code bits is simply the number of bits not used for anything else. Therefore, the number of instruction code bits can be different for different instruction formats. The instruction bits are not necessarily contiguous if the placement of other fields on fixed positions has higher priority in the design.

Register: 1 - 4 fields of 5 bits each.

Used for both operand registers, base pointer and index register.

Predicate register: 3 bits.

Specifies a register used for predicated scalar instructions, masked vector instructions, and flags input and output. Only r1 - r7 can be used as predicate register. r0 means no predicate.

Operand size and type: 3 bits.

Defines the type, size and precision of operands, integer or floating point. The size for integer operands can be 8, 16, 32, 64, and optionally 128 bits. The precision of floating point operands can be single, double, and optionally quadruple precision. Half precision is not supported, except in conversion instructions.

Vector length: 3 bits.

Specifies the length of vectors in bits. Possible values are: scalar, 128, 256, 512, 1024, 2048, 4096, and max. Support for values above 128 are optional. The size of operands is as determined by the operand size/type when "scalar" is specified. A vector will contain as many elements of the specified operand size as can be contained in the vector size. For example, a 256 bit vector can contain 8 elements of 32 bits each. The "max" specification gives the largest vector size supported by the processor. This value depends on the processor and must be a power of 2. The minimum allowed value is 128, with no upper limit. The max value may be different for different operand sizes. A piece of software can take advantage of future extensions by specifying the max vector size. The max value can be reduced by settings in a control register.

Addressing modes: 2 bits.

The following addressing modes are defined for memory operands. An instruction can have no more than one explicit memory operand with this specifiation.

00: IP + index + 32 bits offset (specify r31 for no index)
01: base + 8 or 32 bits offset
10: base + index + 8 or 32 bits offset
11: base + index * operand size + 8 or 32 bits offset

The base and index registers are specified in register fields. The offset size (8 or 32 bits) depends on the instruction format. 8-bit offsets are always multiplied by the specified operand size in bytes. For example, an operand size of 32 bits = 4 bytes will multiply the value in the offset field by 4. 32-bit offsets are not multiplied by this factor. Offsets are always sign-extended. It is not required to support 16-bit or 64-bit offsets or absolute addressing.

Jumps and calls have an offset of 8 or 32 bits relative to the instruction pointer. This offset is multiplied by 4 because all instructions have sizes that are multiples of 4 bytes.

Address offset: 8 or 32 bits.

This field is used as specified above under addressing mode.

Immediate data operand: 8, 32 or 64 bits.

An 8-bit immediate value is interpreted as an integer, sign extended to the specified operand size. The signed value is converted to floating point if a floating point operation is specified.

A 32-bit or 64-bit immediate is interpreted as an integer for integer operations or a single or double precision float for floating point operations. The integer immediate constant is sign-extended if necessary. The floating point immediate constant is converted to the desired precision if necessary.

16-bit immediates are not necessarily supported.

Rounding mode: 2 bits.

Optionally specifies the rounding mode used in floating point operations and conversions. Possible values are: round to nearest or even, round down, round up, truncate towards zero. The default value if there is no rounding mode field is "round to nearest or even". This option field is useful in float-to-integer conversion instructions, but rarely needed in other contexts. May be included in long versions of floating point instructions.

Exception control: 1 bit.

Enables or suppresses interrupts in case of numerical errors. This can be used for controlling exceptions in case of overflow and other errors in floating point operations. Can also be used for checking for overflow in integer arithmetic. An unsigned integer compare instruction with exception enabled can be used for checking if an array index is out of bounds. This feature may be included in long versions of arithmetic instructions.

Broadcast: 1 bit.

If 1, specifies that the last source operand is a scalar to be broadcast into all the vector elements. (Unnecessary when this is an immediate operand).

Zero masked data: 1 bit.

Specifies whether masked-out elements are set to zero or left unchanged. This bit may replace the broadcast bit on instructions with only register operands, or it may be a separate bit.

Other fields.

Other fields may be added if specific features are needed. Alternatively, an immediate data field may be used for specifying additional feature options.

Formats.

Commonly used instructions may be implemented in several different formats and instruction lengths, preferably with the same value in the instruction code bits. For example, an addition instruction, A = B + C, might be implemented in the following formats:

  • 3 registers (1 dword).
  • 2 registers and a predicate (1 dword).
  • 1 register and a predicate and an 8-bit immediate (1 dword).
  • 1 register and a memory operand with base and 8 bit offset, scalar only (1 dword).
  • 2 registers and a predicate and a 32-bit immediate (2 dwords).
  • 1 register and a memory operand with base and 32 bit offset (2 dwords).
  • 2 registers and a predicate and a memory operand with base and index and 32 bit offset (3 dwords).
  • 2 registers and a predicate and a 64-bit immediate (3 dwords).

The destination and the first source operand share the same register in some of these cases. The operand size and vector length bits can be used for specifying integer or floating point operands of different sizes and precisions in scalars or vectors of different sizes. In other words, the different variants of this instruction can be used for adding a register variable, a memory variable, or a constant to integers of any size as well as floating point variables of any precision in scalars and vectors of any size.

Combined ALU and conditional jump instructions may be implemented in the following formats:

  • 2 registers of 32-bit integers only, a condition code and an 8 bit displacement (1 dword).
  • 2 registers, a condition code and a 32 bit displacement (2 dwords).
  • 1 register, an 8 bit immediate, a condition code and a 32 bit displacement (2 dwords).

These instructions have no vector length field but a condition code instead. Floating point operands are not necessarily supported.

3-input instructions, such as fused multiply-and-add may be implemented in the following formats:

  • 3 registers and a predicate and option bits (2 dwords).
  • 2 registers and a predicate and a memory operand with base and index and 32-bit offset and option bits (3 dwords).

The destination register is the same as one of the source operand registers. The options include 4 bits for specifying sign change for even and odd vector elements of the addend and for even and odd vector elements of the product, respectively. This will cover all possible combinations of multiplication with addition, subtraction and alternating add/subtract in a single instruction.

Less commonly used instructions may be implemented in just one or a few different formats.

FPGA

The microprocessor can have an optional FPGA or similar programmable hardware. This structure is used for making application-specific instructions or functions, e.g. for coding, encryption, data compression, signal processing, text processing, etc. This reduces the need for hard-coded application-specific instructions.

If the processor has multiple CPU cores then each core may have its own FPGA. The hardware definition code is stored in its own cache for each core. The operating system should prevent, as far as possible, that the same core is used for different tasks that require different hardware codes. If this cannot be avoided then the code, as well as the contents of any memory cells in the FPGA, must be saved on each task switch. This saving may be implemented as lazy, i.e. the swap of contents is only made when the second task needs the FPGA structure that contains code for the first task.

Extensibility

The evolution of the x86 instruction set is full of short-sighted decisions with no room for future extensions. All kind of weird patches have been used to extend an instruction set that was never designed to be extensible. We must learn from these mistakes and consider the predictable need for future extensions when designing an instruction set.

There is reason to suspect that many of the instructions in the current x86 instruction set have been added with short-sighted marketing reasons in mind. Every new generation of microprocessors must have some new features that the competitors don't have, or new features that can be hyped to make customers buy the new version, according to the marketing logic. Some of these instructions are now obsolete, but still supported by the hardware.

The design of a stable instruction set should not be subject to competition and marketing whims, but designed by an open process with participation of the international hardware and software community, similar to the standardization work in other technological areas. A collective decision process reduces the risk of mistakes and short-sighted decisions.

The proposed instruction set is orthogonal, which reduces the number of different instructions. The inclusion of an FPGA reduces the need for application-specific instructions.

In addition to these considerations, it is necessary to add space for future extensions of the instruction set. Some of the instruction formats should have a large number of unused instruction code bits that can be used for future instructions or option bits. A few instruction format codes should be reserved for future extensions. All codes that begin with 111, i.e. 11 in the instruction length bits and 1 in the first bit of the instruction format field, should be reserved for future extensions. These bits could be used in the future either for more 3-dword formats with many instruction bits, or for 4-dword formats. This decision will be left to the future.

An attempt to execute an instruction with an unknown instruction code should cause an interrupt (exception) in most cases. This makes it possible to emulate new instructions on old microprocessors. In some cases, however, it is desired to make extensions that do not cause interrupts on microprocessors that don't support them. Historically, this has been done with extensions that affect performance, but not functionality, such as memory prefetching and branch prediction hints. This kind of extensions can be made by using various option bits in contexts where they previously made no sense, for example rounding mode bits in an integer instruction. Thus, the processor should ignore certain unused option bits in certain instructions to make this kind of performance extensions possible. Also, a small range of instruction codes should be reserved for future performance-tuning instructions, which will be ignored on processors that don't support them. To be more specific, we will have three categories of unused codes:

  1. Code reserved for future use. Generates interrupt so that it can be emulated.
  2. Code reserved for future use. Generates no interrupt, but behaves as a nop (no operation). Can be used for future purposes that allow the code to be ignored on processors that do not support it.
  3. Code guaranteed to never be used. Will generate interrupt also on all future processors. Can be used for application-specific emulation or forced error messages.

Extending the vector register size

Extension of the vector register size is straightforward without the need to define any new instructions. This makes it possible for software to utilize a new extended vector size even without recompilation. The software can simply specify the maximum vector size and get information from a CPUID instruction about what this maximum vector size is.

We have seen in the history of x86 processors that the first processor generation to support a new and larger vector size has typically done so with poor efficiency. In most cases, it has used a half-size execution unit twice to do a full-size vector operation. This was not necessarily a bad design choice because the software that supports a new instruction set extension typically lags several years behind the hardware.

The situation is different with the extension mechanism proposed here. The software will be able to utilize a vector size extension immediately. The microprocessor should not allow a larger vector size than it can execute more efficiently than if software used the next smaller vector size twice. It may still be worthwhile to allow a vector size that is larger than the execution unit and use this unit multiple times to process a full-size vector. This will save bandwidth in the decoder and use fewer registers than the alternative of repeating the instruction in software. The CPUID instruction should provide complete information about this. This means that it should specify both the maximum vector size that can execute at full throughput and the maximum vector size that can execute at all. These values may be different for different types of operands.

Portability

The ABI, object file format, etc. should be standardized as far as possible in order to allow the same code to be compatible with different operating systems and platforms. This would make it possible, for example, to use the same function libraries in different operating systems. This can easily be achieved for libraries that are doing some mathematical operation and not using any system functions. A more ambitious goal is to establish portability even when some common system functions are involved, such as time functions or handling of multithreaded code. Most importantly, there should be a portable way of generating error messages from a function library. This could be obtained with an error message instruction. This instruction should generate an interrupt (throw an exception). A few register operands can contain codes indicating the type of error, and a memory operand can point to an error message text. All platforms should be able to handle this error condition in a way that is appropriate for the type of user interface. In console mode applications, for example, the error message should go to the stderr output. In graphical user interface (GUI) applications, the error message should be shown in a pop-up window, or whatever method is appropriate for the specific GUI framework.

Error messages should be in the English language by default, with an optional feature for multi-language support. We can expect the need for multi-language support to be decreasing. The problems with multi-language applications are discussed in this document.

ABI and calling conventions

This is an example of how an efficient ABI (Application Binary Interface) can be designed.

Function calls will use registers for parameters as far as possible. The first 24 parameters to a function are transferred in register r0 - r23. Any additional parameters are transferred on the stack in C language order. These parameters are removed from the stack by the caller. The function return value is in r0. Push and pop instructions are rarely used. The return instruction has no offset. The stack is kept aligned by 128 before any call instruction.

Variable argument lists are transferred on the stack with 64 bits for each argument.

Tuples: A structure or class or encapsulated array for which all non-static elements have the same type is transferred in a single vector register if the total size does not exceed 128 bits.

Parameters that do not fit into a single register are transferred by a reference to a memory object allocated by the caller. This applies to: structures and classes with elements of different types, or bigger than 128 bits, or having a non-standard copy constructor or destructor or virtual member function. It is the responsibility of the caller to call any copy constructor or destructor. Any parameters beyond the first 24 parameters are transferred in the same way as if they were in a register, using 64 or 128 bits of stack space, as appropriate.

A return value that does not fit into a register is transferred to a memory location allocated by the caller. A reference to this memory location is treated as the first parameter (before any 'this' pointer).

There are no registers with callee-save status in the case of static linking. This is because the called function does not know the maximum vector register size required by the caller. Instead, there is a mechanism that allows the caller to know which registers are modified by the called function. This information is stored in the object file for static link libraries. The object file format must support this information, which must be stored in the same file as the library function in order to make sure it has the right version. Compilers supporting "whole program optimization" can read this information from a library file before allocating registers.

This mechanism cannot be applied to dynamic linking. Instead, dynamic link libraries are prohibited from modifying certain registers.

The object file format should be a modified ELF format. Dynamic linking should use the Windows DLL method rather than the UNIX shared object method. The code uses position-independent addressing as far as possible. Any remaining relocation is performed at load time. Symbol imputation is not used. This eliminates the need for the inefficient global offset table (GOT) and procedure linkage table (PLT).

Information used for exception handling and stack unrolling should use a standardized and portable table-based method. Debugging information should also be standardized.

Assembly language syntax

The syntax for x86 assembly code has never been officially standardized, but each assembler uses its own dialect. The definition of a new instruction set should include the definition of a standardized assembly language syntax, preferably with the destination operand first.

Summary of advantages

The instruction set proposed here is a compromise between the RISC and CISC principles. A RISC instruction set with a fixed instruction size makes it easy to decode multiple instructions in parallel, but it is a vaste of precious code cache space. If the fixed instruction size is big enough to accommodate all possible instruction types, then it must necessarily be too big for the most common simple instructions and therefore take up too much space in the code cache. The code cache is a precious resource because it is impossible to make the cache bigger without also making it slower. A CISC instruction set with many different instruction lengths makes it difficult to decode multiple instructions in parallel, and this can be a serious bottleneck. The proposed instruction set has just a few standardized instruction lengths: one, two and three dwords of 32 bits each. The length of the instruction is determined by the first few bits. This makes it possible to determine the lengths of multiple instructions in a single clock cycle by a process that resembles the look-ahead carry mechanism in binary adders.

The instruction set is completely orthogonal. An ordinary arithmetic or logic instruction, such as e.g. addition, can have many different versions for different types of operands. It can handle integers of 8, 16, 32, 64, and possibly 128 bits, as well as floating point numbers of single, double, and possibly quadruple precision. The last source operand can be a register, a memory operand, or an immediate constant. The same instruction can handle scalars or vectors of any length. This makes programming simpler and reduces the number of different instructions.

There is only one type of register. The same registers can be used for many different purposes, including integers and floating point numbers of all different sizes and precisions, as well as for Booleans, flags, pointers and references. The registers can also be used for vectors and masks.

Many instructions can be predicated, so that the instruction is either executed or not, depending on a Boolean variable stored in a predicate register. The predicate mechanism can be vectorized, so that the operation is turned on or off for each element in a vector, depending on a mask register containing a vector of Booleans.

Instructions can have short versions that save space by using only two operands, by omitting certain option bits, by using an 8-bit scaled offset in a memory operand, or by using a signed 8-bit constant as the immediate operand. For example, a double precision floating point addition can have immediate operands of three different sizes: a signed 8-bit integer which will be converted to floating point, a single precision float, or a double precision float. This constant is available at an early stage in the CPU pipeline so that there is enough time for converting it to the necessary size and precision without delaying the execution. The need to fetch numeric constants from data memory is eliminated in most cases because numeric constants can be contained in the instructions. This will increase the speed and reduce the load on the data cache. In most cases, the code size will not be increased by the inclusion of numeric constants in the instructions because they replace the addresses (typically 32 bits) of constants stored in data memory, and because it will fit the constants into smaller formats whenever possible. Immediate constants can even be used with vector instructions where the same constant will be applied to all elements in the vector.

The size of the registers is not fixed in the design. It is possible to make bigger and more powerful microprocessors simply by making the registers bigger so that they can hold larger vectors. This mechanism is orthogonal as well. There are three bits in the instruction code which determines the vector length (or a scalar). This makes it possible to write software for future microprocessors with bigger vector registers that do not exist yet. Setting the vector length bits to 111 will give the largest vector size that the microprocessor supports, whatever this is. This makes it simple to support all vector sizes in the same piece of software. This feature also makes it possible to save an entire register even though the maximum register size is not known when the software is compiled. This can be useful in task switches, exception handlers, device drivers and system libraries. There is no limit to how big the maximum vector size can be. A CPUID instruction will tell the software what the maximum vector size is, and there will be a feature that enables a software program to reduce the maximum vector size if it is excessive.

The conventions for function calling, as well as other ABI details, should be specified together with the instruction set. This will improve compatibility and make it possible to use the same function libraries with different compilers, different programming languages, and different operating systems. There are 32 registers. This makes it possible to use registers for function parameters in almost all cases.

Can existing instruction sets be fixed?

The commonly used instruction sets can be divided into two general types, RISC and CISC. The RISC instruction sets generally have a more or less fixed instruction size. All instructions have the same number of bits. The advantage of a RISC design is that the fetching and decoding of instructions is simple and fast. The disadvantege is that commonly used simple instructions take more space than necessary while complicated instructions do not fit into the limited instruction size. Instructions that need many bits for addresses or constants do not fit into the RISC design.

A CISC instruction set has a variable instruction length. The advantage of this is that simple, commonly used instructions can be as small as a single byte, while more complex instructions or instructions with large addresses or constants can have a length that fits the purpose. This provides optimal use of the code cache. The disadvantage is that it is complicated to decode the instructions. Modern microprocessors can execute three or four instructions in parallel in a single clock cycle if no data dependence prevents this. But it is difficult to decode multiple instructions simultaneously when you have to determine the length of the first instruction before you know where the second instruction begins. Therefore, the bottleneck in a CISC processor is quite often decoding rather than execution.

The present article has proposed a compromise between RISC and CISC. The widely used x86 instruction set is a CISC design. Mosts other instruction sets in common use today are RISC designs.

x86 instruction set

The x86 instruction set has a long heritage dating back to the 8086 processor in 1978 where code density was of prime importance. It has been developed through many generations of additions and extensions without ever loosing backwards compatibility. It is a CISC instruction set where instructions can have any length from 1 to 15 bytes, and it is quite complicated to determine the length of each instruction. It has many different types of registers. The latest extension, AVX-512 has 16 general purpose registers of 64 bits each, 6 segment registers of which only 3 are used in 64-bit mode, 8 floating point registers of 80 bits each, 8 MMX registers of 64 bits each which are overlaid on the floating point registers, 32 vector registers of 512 bits each, 8 mask registers of 64 bits each, a flags register and an instruction pointer. This patchwork could certainly need a redesign. Can it be combined with the design principles that are proposed here?

An easy solution would allow the two kinds of code to be used interchangeably and mixed. The new instructions would have to use some bit patterns that are not already in use in the old system. The x86 instruction set has 20 byte-codes that are currently used only in 16-bit and 32-bit mode, mostly for obsolete instructions. These codes can be used for other purposes in 64-bit mode. Therefore, it is possible to make new extensions that can be used only in 64-bit mode. We would prefer 64-bit mode anyway, so it would be possible to make extensions that implement some of the principles described here and still preserve backwards compatibility, but this would still be only patches on a fundamentally flawed, inefficient and outdated design. The 20 unused code bytes are scattered around the code map with only few adjacent code bytes, so it would be impossible to use more than a few of these code bytes without making the whole system completely messy. Most of the bits in the first byte of any new code would therefore be fixed and unusable in such a hypothetical new code design.

A better solution would be to implement a separate mode for the new instruction design and a system for switching between the new mode and the legacy modes. The improvement in performance that can be obtained with a new instruction design is probably not high enough to justify the complications of a dual code system. Instead, we should be prepared to seize the opportunity in case the need for a major revision should arise for other reasons. It is not possible to make a decoder that translates the old codes to the new ones at runtime, because the new system does not support the many different types of registers that the old system has. A translation from the new system to the old one is also not possible. Instead, we would need two seperate decoders that translate the old and the new codes, respectively, to the internal micro-operation format. This micro-operation format probably needs to be expanded to make space for 64-bit immediate constants, but the extra bits can be disabled when they are not needed, in order to save power.

The existing execution units could relatively easily be modified to support the new code design. The 32 universal registers of the new design should obviously be identical to the 32 vector registers of the old design. Combined ALU-and-conditional-jump instructions are already implemented internally in both Intel and AMD processors even though they are not available as x86 instructions.

It is a problem that many current processors have their execution units divided into two main clusters: One cluster is connected to the general purpose registers and handles integer scalar operations, pointer addressing and jumps. The other cluster is connected to the floating point and vector registers and handles all floating point and vector operations. All transfers of data between these two clusters typically involve a delay of one clock cycle. This two-cluster design would be a problem for the new instruction set where all units need access to the same register file.

Itanium instruction set

The Itanium instruction set is a very ambitions RISC instruction set. Itanium instructions are joined into bundles with a fixed size of 128 bits, containing three instruction codes of 41 bits each and a 5-bit template. The three instructions in a bundle will execute in parallel. This explicit parallelism puts a lot of work on the compiler to schedule instructions that can execute in parallel without violating the program logic. The Itanium has a rotating register stack where each program function allocates the number of registers it needs. It has many other advanced features, such as explicitly speculative instructions. The itanium design has not been very successful, due mainly to the difficulties of making a suitable compiler. Another obstackle to the commercial success of the Itanium was a poor support for backwards compatibility with existing software. The Itanium system is so different from other systems that there would be little advantage in combining it with a new instruction set.

Other RISC instruction sets

Most other commonly used instruction sets today are RISC designs. These designs are generally simple and efficient. The instruction length is typically 32 bits. Some systems, such as ARM-Thumb-2 and AVR32 can use a mixture of short 16-bit instructions and longer 32-bit instructions. Most systems have several different register types. Some RISC processors support vector instructions with 128-bit vectors. There is a limit to the number of different instructions that can be coded in an instruction with a fixed size. It is a general problem with RISC instruction sets that they cannot support complex instructions with many option bits. This makes it difficult to add new options and features that the x86 instruction set has, such as masked vector operations, options for controlling rounding mode, etc. The limited instruction size of the RISC systems is also a problem where large addresses or large numeric constants are needed. It is not possible to define a large numerical constant or a jump to a distant address with a single instruction in a RISC design with a limited instruction size. Some of the RISC processors already have support for more than one instruction set and features for switching between these modes. An additional mode for a new instruction set could be added to these processors without serious problems.

   
Itanium
Author:  Date: 2015-12-28 01:04
Agner, what's your opinion on the Itanium instruction set in isolation, assuming a compiler is written and backwards compatibility do not matter?
   
Itanium
Author: Agner Date: 2015-12-28 01:46
Ethan wrote:
Agner, what's your opinion on the Itanium instruction set in isolation, assuming a compiler is written and backwards compatibility do not matter?
The advantage of the Itanium instruction set was of course that decoding was easy. The biggest problem with the Itanium instruction set was indeed that it was almost impossible to write a good compiler for it. It is quite inflexible because the compiler always has to schedule instructions 3 at a time, whether this fits the actual amount of parallelism in the code or not. Branching is messy when all instructions are organized into triplets. The instruction size is fixed at 41 bits and 5 bits are wasted on a template. If you need more bits and make an 82 bit instruction then it has to be paired with a 41 bit instruction.
   
Proposal for an ideal extensible instruction set
Author: hagbardCeline Date: 2015-12-28 04:08
You should take a look at RISC-V [1], which satisfies all the requirements you set, has broad involvement of academia and strong interest by the industry.

[1] riscv.org

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2015-12-28 07:28
hagbardCeline wrote:
You should take a look at RISC-V [1], which satisfies all the requirements you set, has broad involvement of academia and strong interest by the industry.

[1] riscv.org

Thank you for the reference to RISC-V. I remember reading about it years ago, but couldn't remember the name. I tried in vain to find it with google.

RISC-V does indeed cover many of the same principles that I talk about. However, it seems to be more inspired by small systems of the past than by the bigger and more powerful high-end processors available today. A new ISA has to be future-oriented and performance oriented. Some of the things that I miss in RISC-V are:

  • It is not completely orthogonal
  • Arithmetic instructions cannot have memory source operands
  • Immediate constants have odd sizes. It is not possible to include floating point immediates, which I argue would be more efficient than loading floating point constants from data memory
  • There are no predicated or masked instructions
  • 128-bit integers are not supported, except as pointers in 128-bit address mode
  • Support for vectors is not well developed. Vector size is limited to 1024 bits
  • There is no way to save and restore a vector register that is guaranteed to be compatible with future extensions
  • Software has to be recompiled each time a different processor with different maximum vector size becomes available
  • There is no support for integer vectors, Boolean vectors, masked vector operations, broadcast, etc.
  • long int can be 32 or 64 bits. There is no standardized way of specifying 64-bit and 128-bit integers. This inconsistency is causing annoying compatibility problems today which need to be fixed in any new ABI.
But again, I like the idea behind RISC-V
   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-01-04 05:57
While I fully agree with the stated purpose of RISC-V, I strongly disagree with some of their choices for instruction encoding, especially with their addressing modes.

The features proposed by Agner are much closer to what I would consider a good ISA, and I have a lot of experience in programming in assembly language for a huge number of different ISA's from antiquities like IBM System/360, PDP 11, Intel 8080 or Motorola 6800 up to current ISA's, like Intel/AMD, IBM POWER or ARM.

I only want to comment about the addressing modes, because many other ISA proposals, including RISC-V, do not seem to have any clue about how they are used in real programs.

There are only 2 possible choices for a set of addressing modes that would allow writing a loop with a minimum number of instructions.

The first possible choice coincides with the subset of the VAX 11 addressing modes implemented in Intel 80386, which, like in the Agner proposal, allows addresses with 3 components, a base register, a scaled index register and an offset included in the instruction.

This choice of addressing modes was probably the only feature of the Intel 80386 ISA that was better than in the earlier Motorola MC68020. Motorola has chosen to implement almost all the addressing modes of VAX 11, not only the subset chosen by Intel, but the addressing modes omitted by Intel were not really useful, so eventually even Motorola abandoned them in the ColdFire processors.

The second possible choice for the set of addressing modes appeared initially (around 1980) in one of the IBM RISC processors that were later developed into IBM POWER. This choice was also adopted by ARM, after it was published by IBM at one of the early RISC conferences.

This second choice is to allow addressing modes with only 2 components, a base register and an offset either in a register or in the instruction, but to allow updating the base register with the computed address.

I believe that the IBM choice is somewhat better, but both choices are acceptable. Any other set of addressing modes, e.g. RISC-V, is wrong, because it requires in almost all loops the insertion of extra instructions for updating the addressing registers.

Even if the hardware could execute the extra instructions in parallel, there would be a waste of resources anyway, because the extra instructions would occupy decoder slots and space in the instruction caches.

The IBM choice has the advantage that it does not require a second address adder and a shifter for scaling, but the disadvantage that it requires an extra write port into the register file.

From a software point of view, the IBM choice has the advantage that it is universal, i.e. it can be applied to any loop, while the Intel 80386 choice can be applied only to loops where the data structures have been chosen carefully. The reason is that, in order to avoid extra address updating instructions, the scaled index register must be the loop counter, and this, together with the limited set of values that may scale the index, forces that the ratios between the sizes of the elements of the arrays accessed during the loop must belong to the set of scale values (1, 2, 4, or 8 for Intel/AMD).

Nevertheless, these constraints for data layout are acceptable in most cases.

In order to evaluate which choice is cheaper from a hardware point of view, it is necessary to know exactly the technology used for implementation. If a second write port would be needed anyway for the register file due to other reasons, then the IBM choice would be certainly cheaper.

So, in conclusion, the set of addressing modes proposed by Agner is certainly much better than that of RISC-V.

I also completely agree with the use of a set of general registers instead of a dedicated flag register.

There should also be a complete set of instructions that would allow the writing of efficient programs for multiple precision computation, e.g. the GMP library.

Despite the ugliness of most of the legacy part of the Intel ISA, during the last 10 years Intel has improved continuously the support for multiple precision computation, leaving all the competition far behind.

All the RISC ISA's had traditionally bad support for multiple precision computation, even if that had nothing to do with the RISC principles. Even in the old days, when the need for encryption was not yet widely recognized, there were some users, like myself, who executed frequently that kind of instructions for scientific and technical applications.

   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-01-04 06:41
While I agree that having universal registers would be much better than having separate integer & FP registers, I doubt that it is good to have overlapped scalar & vector registers.

I think that it would be better to have 32 scalar registers and 32 vector registers. I do not think that this implies any significant changes in the instruction encoding that you have in mind.

This would certainly simplify the task of the operating system and of the interrupt routines to decide which registers should be saved.

This would certainly also make easier any extension to much longer vectors. I have seen several opinions on the Internet, and I agree with them, that many features of the ISA's used by Cray and by a few other vector processors were actually much more convenient to exploit in software than the current MMX/SSE/AVX style of vector instructions.

In conclusion, I believe that separate scalar & vector registers would be simpler to use, because the scalar registers, having a known length, can be saved in a predictable way, without examining any state registers. You could make a sophisticated save unit that to would save only the non-null parts of the vector registers, but it would insert an unpredictable delay that would not be acceptable for interrupt routines.

Moreover, any program already has distinct scalar & array variables, so mapping them to scalar & vector registers is trivial.

Like I said, I also think that a reading of the old Cray manuals that are freely available, e.g. at

bitsavers.informatik.uni-stuttgart.de/pdf/cray/

and the comparison of their vector instructions with AVX-512 to see what is best between themselves and how they can be extended to greater vector lengths, could be useful to improve your ISA proposal.

   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-01-04 09:20
I agree with most of the features of the proposed ABI.

Nevertheless, I believe that a modern ABI should take into account that C is no longer the only language for which the ABI should be adequate.

Some requirements of other languages can be easily accommodated, e.g. for languages that allow returning multiple values, they can be placed in multiple registers starting with r0, exactly like the input arguments, not only in r0, like the single return value of C.

A much more important requirement of other languages is to allow the efficient implementation of procedures with tail calls, e.g. with tail recursion.

For this, the stack must be deallocated in the called procedure, not in the caller.

The one and only reason for the existence of the so-called C calling convention, where the caller deallocates the stack, is that it was a lazy solution to the (former) existence of lazy C programmers, who called vararg functions, e.g. printf, without also including the appropriate header where its prototype was declared (or including pre-standard headers, where vararg functions were not marked), thus the compiler could never know if an external function was vararg or not, so it had to suppose that all of them are vararg.

If such practices are prohibited, as they should be, then the right implementation of vararg functions is that the compiler must add an extra hidden parameter, e.g. the old value of the stack pointer, that would allow the called procedure to correctly deallocate the stack.

In that case the ABI should specify that the callee must deallocate the stack. There is absolute no advantage to defer the deallocation until after the return.

Besides allowing efficient tail calls, this ABI rule would also reduce the size of the code, because it replaces multiple deallocation instructions from the callers with a single instruction in the callee.


I am aware that the defendants of the C calling convention typically claimed that its code size disadvantage is not so great, because the compiler may coalesce several deallocation instructions into only one, inside the caller (if it includes multiple procedure calls).

I do not agree with this claim. The main environment where C is still dominant, and where code size is also essential, is in programs for embedded computers. However, that is also the environment where the stack size is severely constrained and deferring stack deallocation, to reduce the code size, greatly increases the risk of stack overflow, so that is not an acceptable solution.

   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-01-04 10:11
I want to add something to my former post where I mentioned that the classic vector computers, e.g. Cray, could provide useful inspiration for a new ISA, besides the modern SIMD instructions, e.g. AVX-512.

I do not remember now who made this observation first, because I have read it somewhere, but I agree with it.

The main advantage of the classic vector computers is that they had a vector length register, which determined the size of the performed operation, and that length could have any value between 0 and the maximum length of the vector registers, i.e. it was not restricted to powers of two, like in the instruction field from your proposal or like in AVX, Neon etc.

This has the advantage that it simplifies considerably the code that must deal with the initial or final parts of the arrays whose sizes are not a multiple of the size of a vector register or which do not have a correct alignment.

For AVX-512 and longer vector registers it is likely that the code dealing with all the possible cases of lengths and alignments will be much larger than the code for the main loop, and it will increase in geometric progression for each new maximum vector length.

With a vector length register, this extra code will be much smaller and its size will remain unchanged when the size of the vector registers will be increased.

Unlike the separate scalar & vector registers, which I believe to be mandatory from the perspective of the operating system & ISRs, and which is a feature with minimal influence upon your proposed instruction encoding, the use of one or more vector length registers (maybe the registers r8 to r16 , like the use of r0 to r7 for predicate registers) might require more significant changes in your proposal for instruction encoding, e.g. the use of those 3 bits to specify a vector length register instead of a length, so you must assess if you believe that it is worth it.

Like I said this is not my proposal and I do not remember right now where I have read it, but I found its argumentation compelling.

   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-01-05 06:47
Just an additional explanation for my previous post.

I was not clear enough, but one possible modification of your encoding scheme for vectorial operations would be to keep 0 = scalar registers & 7 = vector registers of maximum length, but to have 1 to 6 specify a vector length register, e.g. r9 ... r14.

This would partially loose the advantage of your encoding scheme of allowing old programs to run unchanged on new processors with longer vector registers.

Nevertheless, that could still be done, by querying the maximum length of the vector registers with CPUID and acting accordingly. The ability of performing a vector operation with a specified vector length would make easy the writing of generic programs that would process correctly the final or initial part of the data arrays, regardless of the length of the implemented vector registers.

   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-01-05 10:47
Re: vector length issues:

Various folks at Berkeley have been looking at flexible vector architectures for microprocessors. A search for Krste Asanovic and vectors should point you at much of the recent work.
Their proposals have long included reconfigurable vector registers -- allowing a block of SRAM to be divided into different configurations of vector length vs number of vectors.
More recent proposals would allow the elements of different vectors to be of different sizes.

Asanovic has argued for an ISA that would allow decoupling the vector width from the parallelism of the underlying hardware, so a single binary could have its vector instructions pipelined through whatever number of "vector pipelines" an implementation happened to provide. The presentation slides are at riscv.org/workshop-jun2015/riscv-vector-workshop-june2015.pdf

I like the ideas, but have not looked at them in enough detail to form an opinion about their practicality.

I can say that I am getting very tired of trying to work around the limitations of existing SIMD vector ISAs. They are great when everything is lined up, but in that case you are almost always bandwidth-limited so the extra functional units don't help. They are a real pain when the data in the registers needs to be rearranged, which is the most common way that physics-based codes generate computational intensity.

   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-01-06 07:12
Yes, you are right, this presentation from Berkeley by Krste Asanovic about the need for the resurrection of the vector computers was what I had in mind.

While most of the features of the ISA proposed by Agner would be significant improvements over RISC-V, the inclusion of conventional packed SIMD instructions would be much less desirable than the kind of vector ISA extension proposed for RISC-V.

I am currently using the hardware that during 2015 had the best performance-per-watt and the best performance-per-dollar for double-precision computations (Xeon D-1540 + FirePro W8100), so I have no complaints about the peak performance achievable.

Nevertheless, approaching that peak performance requires a lot of annoying optimizations and special case processing imposed by the programming for packed SIMD + GPU, so a decent vector ISA would be a clear improvement. Thus I completely agree with Krste Asanovic.

   
Proposal for an ideal extensible instruction set
Author: Ook Date: 2016-01-05 17:00
Adrian Bocaniciu wrote:
This has the advantage that it simplifies considerably the code that must deal with the initial or final parts of the arrays whose sizes are not a multiple of the size of a vector register or which do not have a correct alignment.

For AVX-512 and longer vector registers it is likely that the code dealing with all the possible cases of lengths and alignments will be much larger than the code for the main loop, and it will increase in geometric progression for each new maximum vector length.

With a vector length register, this extra code will be much smaller and its size will remain unchanged when the size of the vector registers will be increased.

AVX512 also adds masking which is supposed to keep the prolog/epilog overhead small.
I have no hard data if this will work out.
Another approach to get rid of this overhead is the software pipelining variant as described by Ivan Godard in one of his Mill talks.
   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-03-27 00:39
The rationale behind the RISC-V instruction set:

>>"There are no predicated or masked instructions"

This is to simplify Out-Of-Order & superscalar designs:
This is why there aren't even condition codes; branches make their own comparison. (BLT src1,src2, dst etc)

The dependancies in the instruction are all expressed directly in fixed locations (registers) allowing these to be reasoned about very early in the pipeline, and without having to track additional state in the reorder buffer. Predication/CMOV etc introduce additional dependancies.

(I was disappointed there's no 'Select' but given everything else, it's ok, the proposed vector handle the case where you'd want it)

>>"Arithmetic instructions cannot have memory source operands?"
>>"Immediate constants have odd sizes. It is not possible to include floating point immediates, which I argue would be more efficient than loading floating point constants from data memory"
>>"128-bit integers are not supported, except as pointers in 128-bit address mode"

This is all a consequence of the simple RISC predominantly load-store, fixed-length 32bit instruction idea, which simplifies pipelining: the instruction decoder is trivial.
(there is a compressed 16bit instruction capability, but it's optional)

classic RISC design makes it possible to produce a very simple implementation. RISC-V is designed to scale from (i)very small embedded cores to (ii)large high performance cores, or (iii)high throughput (manycore) accelerators (the latter 2 being different cases). The RISC instruction set principles have proven to work in all contexts.

The choice for a pure RISC is justified as follows: In the past few decades CISC has only ever been pursued for backwards compatability with x86. Most attempts to produce something else from a clean slate adopted RISC ideals, and the only other processors that became popular were RISCs.
going all the way back to the simple case, I recall ARM first implementation was a 28000 transistor pipelined 32bit implementation whilst the contemporary 68000 was literally 68000 transistors for a non-pipelined, 16bit version; it's proven RISC can scale all the way down (even today with huge transistor counts, there is the possibility of building a manycore dataflow processor; adapteva are probably pursuing this, given their previous product and their stated intention to use RISC-V next.. the idea is to cram as many cores as possible with local memories onto a die). Another important use case is as the basis for accelerators: you just need *something* simple&complete to handle basic computation as a basis for some custom unit revolving around some extention.

>>"Support for vectors is not well developed. Vector size is limited to 1024 bits"
>>"Software has to be recompiled each time a different processor with different maximum vector size becomes available
"
>>"There is no support for integer vectors, Boolean vectors, masked vector operations, broadcast, etc."

take a look at the hwacha vector unit: it definitely decouples vector size from ISA, and does have prediction and so on. but it's deliberately not part of the basic standard, I think they are still finalising it? a high throughput accelerator can always be built by going manycore with attached vector units.

I don't think broadcast suits them because it's designed for the vector 'lanes' to be completely independent for hiding latencies: it's more comparable to GPGPU (and the old cray vector machines) rather than x86 style SIMD.

Still , after all that, mayebe someone else will eventually make a contrasting 'classic' 4-element SIMD unit extention that might suit 3d maths better (with permutes for accessing x/y/z/w, a dot-product across the lanes etc.). But from what I can gather the hwacha unit is easier to compile general purpose code for.

>>"long int can be 32 or 64 bits. There is no standardized way of specifying 64-bit and 128-bit integers. This inconsistency is causing annoying compatibility problems today which need to be fixed in any new ABI.
"
I think they have an unusual intention to scale to 128bit for future data centres, although I don't know what they'd do about that.

All in all I'm a fan of the RISC-V decisions. The whole thing reminds me of MIPS which was a great ISA.

I do like the idea of a unified register file, however at some point for creating a standard you have to draw a line. the advantages of separate files are in instruction length, they only need 5bits rather than 6, and of course implementations could physically move those register files closer to their units respective execution units

   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-01-11 17:48
On Scalar vs. SIMD register banks:

I think that it is actually better to have separate register banks for scalar vs. SIMD registers. There is strong evidence that the underlying hardware is better-off when it manages scalar/simd register banks independently, and most newer ISAs have been designed with this goal in mind -- if nothing else, it greatly minimizes the effort needed to avoid false dependencies during scalar operations. Unless having a unified register bank is seen to make things easier on the programmer, then there seems little motive to prefer a unified register bank. From a compiler-author's standpoint there's no advantage to a unified register bank compared to scalar/vector register banks. In some ways the separate register banks make things easier (less dependency tracking and less clever register resource guessing required), though a unified bank can make an ABI simpler.

Secondly, separate register banks can allow for more total registers and/or shorter instructions. 32 scalar + 32 SIMD registers = 64 total, without needing an extra bit to encode 0-64.

In practice it is the exception when SIMD algorithms need to load values to/from scalar registers. The most common scenario is a 32-bit integer broadcast, and typically these can be setup outside of a loop. The most common reason to load values from SIMD to scalar is to obtain a mask result, which is then compared to some bitmask on the ALU. Having a "mask-move + imm32 test" integrated directly into the SIMD unit would avoid that inter-bank move.


Regarding 512+ Bit SIMD:

I see no value in trying to design a CPU ISA that can support 512 bits. I feel as thought it is a waste of CPU ISA design resources. Any algorithm that can go that wide is almost certainly better suited to a GPU ISA, where workgroups of 32 or 64 SIMD threads are fired off at-once (with each thread being 128 or 256 bit SIMD). There is not a foreseeable future where a GPU ISA is not available for use on a device that has a high-end SIMD 256+ capable CPU. The vast majority of data being processed is sets of 2 or 4 single or double-precision values. In all but the most obscure situations, it makes more sense to think in terms of many threads of 128-bit or 256-bit data, rather than trying to scale a single thread up to 512 bits. This is easier for the compiler, hardware, and programmer, and it's a no-contest in terms of performance. Does anyone really think that 512-bit AVX will come anywhere close to the SIMD throughput achieved by even modest integrated GPUs? I recommend to keep the CPU ISA simpler and instead there should be some effort focused on helping to tear down the remaining barriers to utilizing GPU for very-wide SIMD. Or using FPGA.

... of course Intel would never agree to such a design, since "more bits AVX++!" is their best bet to market new CPUs still. -_-

-Jake

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2016-01-12 07:18
Thank you for all the useful comments, ideas and references. This gives me reason to discuss several possible improvements:

Vector length register

The idea of using an extra register to specify the length of a vector is excellent. This makes it possible to change the length of a vector at run time and to use a length that is not a power of 2. We can still have a 3-bit vector length field in the instruction code, but the meaning will be as follows. A value of 0 means a scalar. A value of 1 - 7 means that the length is specified by one of the registers r9 - r15. It is the responsibility of the compiler (or assembly programmer) to make sure that the specified length does not exceed the maximum length supported by the actual CPU. This will make the hardware a little more complicated, but it will be a great advantage to the software.

The common system with fixed vector sizes has a big problem when vectorizing a loop through an array. Often, the programmer or compiler does not know in advance whether the array size is a multiple of the vector size or not. Therefore, it is necessary to make extra "remainder" code in the end to handle any remaining array elements that don't fit into the vector registers. Another big problem is that there must be different versions of the code for microprocessors with different maximum vector sizes if you want optimal performance.

A system that allows variable vector sizes can use the maximum vector size supported by the microprocessor for all but the last iteration of the loop and then use a smaller vector size for the last iteration if necessary. The performance of such a loop can be further improved if we make a special instruction that finds the vector size as the smallest of the two values: the remaining number of array elements and the maximum vector size. This is easier to implement in software than masking off the rest of the vector.

It will also be possible to make library functions that have vectors of variable length as parameters and result.

It will not be too complicated to implement a variable vector size in hardware. The hardware can simply mask off the unused part of the vector register. There may be a problem with power consumption. A processor can save power when handling a vector of less-than-maximum size by either clock gating the unused part or by turning off power to the unused circuits. If the vector size is specified by an extra register then the information about the actual size of the vector will be available at a rather late stage in the pipeline and the CPU will have less time to adjust its power consumption.

Another problem is the extra dependencies. An instruction can have up to three input operands, a mask register and a vector length register. This gives a total of five input dependencies. The out-of-order scheduling system becomes more complicated if it has to handle this many dependencies.

Nevertheless, I am sure that the gains in software efficiency more than outweighs the hardware costs of supporting a variable vector length.

Alignment of vectors

It is easier for the hardware to write and read vectors to/from memory if the vector size is a power of 2 and the memory operand is aligned to an address that is divisible by the size of the vector. Various systems have different requirements for the alignment of vectors in memory. The requirements for alignment makes the software more complicated. The stack is always the preferred place to store local data in a function. There are two common ways of aligning data on a stack to a value higher than the stack word size:

  1. Keep the stack aligned by the required size and propagate this alignment through the chain of function calls. A function must insert unused space on the stack before calling another function, if needed, in order to keep the stack aligned. The problems with this method are: it wastes stack space; it requires extra instructions in functions even when the alignment is not actually used; it does not take into account future systems with bigger vectors requiring higher alignment; and it may fail if different parts of the code have been compiled for different alignments.
  2. Adjust the stack frame to the required alignment inside any function that needs aligned storage. This requires more instructions than method 1, but only when alignment is actually needed. A further disadvantage is that is uses an extra register for addressing the aligned stack frame or for saving the stack pointer.

Both methods are in use. The current x86-64 systems use method 1 for 16 bytes alignment and method 2 for 32 bytes alignment. Method 2 is probably preferable in a system with plenty of registers.

A third possibility is to modify the hardware so that no alignment is required. Current state-of-the-art microprocessors have no performance penalty for misaligned memory operands, except when a cache line boundary is crossed. I am not in a position to weigh the hardware costs of handling unaligned memory operands against the software costs of aligning data, but this would be the optimal solution if the hardware costs are not too high.

Whatever solution we end up with, there are certain things that should preferably be coordinated: The alignment of memory operands, the stack alignment, the minimum vector size that must be supported, and the required alignment for arrays. If the hardware can support vectors aligned by 8 bytes efficiently, then we may decide the rule that vectors in memory must be aligned by 8 and that all arrays containing at least 8 bytes must be aligned by 8 so that they can be handled in vector registers.

Should vectors and scalars use the same register set?

Two commentators have argued that it is better to have separate registers for scalars and vectors, rather than just one register set as I first proposed. The scalar registers can be saved in a way that is sure to be compatible with future extensions because their size is fixed. This will make it possible to have rules for callee-save registers, i.e. registers that a function must save and restore if they are used. For example, we may decide that scalar registers r24-r31 have callee-save status. Such registers are useful for saving data across a function call. Vector registers cannot have callee-save status because it is difficult to save a vector register in a way that is compatible with future extensions of the vector size. Instead, we will use information stored in object files and static libraries in order to know which vector registers are modified by a function as explained in my first post.

While we may split the registers into vector registers and scalar registers, I don't want a further split into integer and floating point registers. In the case of vectors, there are many instructions that can be applied equally well to integer and floating point data, such as read, write, move, broadcast, blend, permute, gather and scatter. Masks generated with integer instructions can be applied to floating point data. Integer instructions can be used for manipulating floating point values, e.g. manipulating the sign bit, manipulating NAN and INF values, splitting a floating point value into exponent and mantissa, etc.

The same instructions can still be used for both scalars and vectors. A zero in the vector length field of an instruction can indicate that a scalar register is used instead of a vector register. A vector register can still be used for scalar operations if the vector length register contains the value 1. If the same instruction can be used with vector registers and scalar registers, then it follows that the scalar registers can handle both integer and floating point values, just as the vector registers. The only exception is that 128-bit integers and quadruple precision floating point numbers, if supported, can only be handled in vector registers.

The predicate/mask field can indicate either a predicate in a scalar register or a mask in a vector register, depending on whether the vector length field indicates a scalar or a vector. The vector length register is always a scalar register. Instructions that cannot be used with vectors do not need a vector length field.

Function calling conventions

The function calling convention and the order of parameters on the stack needs further discussion. First, we need to decide if the stack should grow downwards, as is common today, or upwards, which would be more logical. The tradition of making the stack grow downwards originates from non-protected operating systems where global data and heap grow up from the bottom of the RAM space while the stack grows downwards from the top until the two meet and the memory is full. This is less relevant in modern systems with virtual memory addresses and multiple threads with each their own stack. I have no compelling reason to prefer the stack to grow upwards or downwards, so let's defer this question. For now, I will assume that the stack grows downwards.

I have proposed that parameters on the stack should be stored in C order. This means that the first parameters are stored at the lowest addresses on a downwards-growing stack. If parameters are stored with push operations, then the last parameter must be pushed first when C order is used. Today, it is common to allocate stack space for the parameters first, and then store the parameters into this stack space, rather than using push instructions. The opposite order, which I will call Pascal order for lack of a better term, has the first parameter pushed first. Pascal order is more logical for assembly programmers if push instructions are used, while C order is more logical if the stack frame method is used because it has the first parameters at the lowest address.

The C parameter order was invented in order to facilitate function calls where the number of parameters was not specified, and in particular for functions with varargs, i.e. a variable number of parameters. I want to propose a completely different solution for varargs. Instead of putting varargs parameters on the stack or in registers, I propose to put them in a list which can be stored anywhere in memory, and only transfer a pointer to this list as a parameter. This has several advantages: No stack space is needed if the pointer can be transferred in a register. The length of the parameter list can be modified at run time. The parameter list can be reused for multiple function calls. And a function can easily forward its parameter list to another function.

If we allow the use of the first 24 scalar registers and the first 24 vector registers for function parameters, then we can have a function call with up to 48 parameters without saving any parameters on the stack. Varargs lists will not use the stack either if a pointer to a list is used. Now the order of parameters on the stack certainly becomes less important. Nobody will use assembly language to call a function with so many parameters, and few compilers will use push instructions, so the argument about the order of push instructions is irrelevant. However, there are still two arguments for using the C order: If the stack grows downwards then the C order will have the first parameter at the lowest address, which is more logical for both caller and callee. The second argument is that the first parameter will be closest the the address pointed to by the stack pointer. If the caller and callee disagree on the number of parameters because they are relying on different versions of a function prototype, or whatever, then the C order will make errors in the last parameters only while the Pascal order will have errors in all the stack parameters. The C order will therefore make it easier to locate such an error.

Another issue is whether the caller or the callee should clear the stack. I will argue that it is safer to let the caller clear the stack. If the callee clears the stack and the caller and callee happen to disagree on the amount of stack space used, then the stack will be corrupted and the system is likely to crash. But the system is not guaranteed to crash in this case. All kinds of unpredictable and disastrous things can happen. It is possible, for example, that a function pointer has been stored on a stack space where the caller expects to find a saved return address after the stack has been corrupted. This will cause it to continue in the pointed-to function and the result will be unpredictable.

Efficiency is not an issue here because the stack clearing rule is relevant only when calling a function with at least 25 parameters, and possibly 49. This is a very rare situation anyway. The adjustment of the stack pointer requires only a single instruction, which is likely to execute out of order. The impact on performance is zero or at least insignificant for such a large function.

The programming language is irrelevant here. These arguments apply to all programming languages. The efficiency of tail calls is not affected by who clears the stack. A tail call will save one stack clearing in both cases.

We need a rule for parameters that do not fit into any type of register. I will propose that such parameters should be stored in memory and a pointer to the parameter should be transferred as a parameter. The function is allowed to modify the data that the pointer points to, unless the parameter is declared const. It is the responsibility of the caller to call any copy constructor or destructor of the parameter. It is an open issue whether a function should be allowed to modify a parameter in a varargs list.

Link register

Some systems store the return address of a function on the stack, while other systems use a link register to hold the return address. The advantage of a link register is that a leaf function can be called without storing anything on the stack. This saves cache bandwidth in programs with many leaf function calls. The disadvantage is that every non-leaf function needs to save the link register on the stack before calling another function and restoring the leaf register before returning.

RISC-V specifies that the link register is one of the general purpose registers. I will argue that if a link register is used then it should be a special register. A link register does not need to support all the things that a general purpose register can do. If the link register is included as a general purpose register then it will be tempting for a programmer to save it to another general purpose register rather than the stack, and then end the function by jumping to that general purpose register. This will work, of course, but it will interfere with the way returns are predicted. The branch prediction mechanism in modern microprocessors use a special mechanism for predicting returns, which is different from the mechanism used for predicting other jumps and branches. This mechanism, which is called a return stack buffer, is a small rolling cache that remembers the addresses of the last calls. If a function returns by a jump to another register than the link register then it will use the wrong prediction mechanism, and this will cause severe delays due to misprediction of the subsequent series of returns.

The only instructions that are needed for the link register other than call and return, are push and pop. We can reduce the number of instructions in non-leaf functions by making a combined instruction for "push link register and then call a function" which can be used for the first function call in a non-leaf function, and another instruction for "pop link register and then return" to end a non-leaf function.

Tail calls will be equally efficient with and without a link register.

High level language support for vectors

C/C++ compilers often have support for vector registers. This includes system-specific types, e.g. __m256 for a 256-bit vector of 8 single-precision floats. If we want high-level language support for a system with variable length vector registers then we must have a system-specific addition to the programming language that defines such vectors, e.g. __vector<float,8> for a vector register of 8 floats. These types can be used as function parameters and function returns. The definition of a function with a vector parameter of a certain size implies that the CPU must support vectors of the specified size. Therefore, it is possible to transfer such a vector parameter in a single register and return such a vector in a single register regardless of its size.

It will also be useful to have a way of specifying vectors of variable size in a high-level language where the size is specified in a separate parameter.

Software pipelining

Support for software pipelining was proposed. This requires a rolling bank of registers that can be allocated to a loop. Software pipelining can improve the performance of complex loops in cases where out-of-order scheduling would otherwise be a bottleneck. Support for software pipelining would be a complicated addition to the architecture, and it would be irrelevant for the quite common cases where performance is limited by memory bandwidth, cache performance, etc. Some of the mechanisms are patented, which could be an obstacle to including it in an open standard.

I think that we should allow experimentation in this area and be prepared for a future extension of the standard with optional support for software pipelining.

Further discussion topics

RISC-V proposes a 128-bit addressing mode. I don't understand what such a huge address space can be used for. 64-bit addressing gives us more than 1019 bytes of address space. This is more than we will need in any computer in a foreseeable future. Some have argued that 128-bit addresses can be useful in CPU clusters and clouds. But I don't think that a CPU at one node in a cluster should be allowed to directly address a RAM cell at a different node. 128-bit addresses would mean that every entry on the stack would use 16 bytes, most of which would be zero. This would be a waste of precious cache space. The system would certainly be simpler if the only allowed addressing mode is 64 bits.

McCalpin proposes a reconfigurable register space. This would add a lot of complication to the instruction set as well as to the hardware. I am suspecting that the hardware necessary for supporting register reconfiguration would take up more silicon space than simply allowing all the registers to have full size.

   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-02-02 17:19
At the moment, the biggest red flag I can see in this proposal is the use of "doubleword" to refer to 32 bits. Everywhere outside the Wintel enclave - including the ubiquitous IEEE-754 standard - "doubleword" means 64 bits, and 16 bits is referred to as "halfword". It's a Freudian slip which betrays a certain cognitive bias towards the same obsolete 16-bit architecture you're proposing to replace. Not what one wants to see when trying to look forward.

I'm also somewhat perplexed by your simultaneous assertion that a fixed-size RISC instruction word causes low code density, and the specification of a *minimum* instruction word size that is the same as a typical RISC instruction word size. The only efficiency you gain is the ability to use large immediate operands in a single instruction, but modern RISC ISAs (Alpha, AArch64, PowerPC) can already build an arbitrary 32-bit immediate in 2 instructions (64 bits), limiting your code density advantage to immediate operands larger than 32 bits. These are not so common.

I'll admit that combined load-arithmetic instructions can improve code density, but that comes at the expense of a more complicated front-end in hardware (or, for traditional in-order CISC, a more complicated pipeline) and has absolutely nothing to do with instruction word size. You do also gain a certain future-proofing flexibility by allowing longer instruction formats, but that has absolutely nothing to do with code density.

With that said, I have a different proposal for handling vectors, which I think is closer to the original Cray model. In this model, there are no architectural "vector registers", only scalars and "pipeline slots". Conceptually, the machine appears to repeat instructions a given number of times on successive data elements, but without an explicit branch instruction, similarly to the x86 "string" instructions.

The cleanest way to specify this I can think of is, oddly, similar to the x87 stack model. Now, x87 was a horrible model for high-performance arithmetic, because each instruction could do only one operation, it was impossible to specify software pipelining (executing more than one complex expression in parallel), and it was therefore hard to extract ILP at runtime. But it did allow specifying a single expression compactly and without explicit reference to register names. Substitute Forth as a mental model if you prefer.

Vector instructions would thus act as if on scalar values, with explicit load, store and pointer-update operations, referring to this stack for their virtual input and output operands. Instead of executing immediately, they would be stored in a buffer, and decoded into a pipeline of operations, with the expectation of operating this entire pipeline in multiple-parallel at maximum performance. The pipeline would implicitly be complete when the operand stack became empty; attempting to execute a pipeline with an imbalanced stack would be a trap error.

The complete pipeline would then be executed by loading the initial values into the relevant scalar registers (which were specified using "input" instructions), followed by a count value into a special-purpose register. The normal instruction flow would also continue, but a pipeline-wait instruction would prove useful.

Interrupts, including page faults, would not inherently disrupt this pipeline building or execution process, and would be able to use the scalar registers independently. It would be necessary to halt, save, restore and resume the pipeline state (whether empty, in the process of being built, complete, or executing) for context switching and page-fault handling.

The great advantage of this system is that the program need be aware of neither the number of operations the CPU can perform in parallel (which was an inherent flaw in Itanium, as well as with block-SIMD), nor the alignment requirements of the memory system (bar those of the individual data elements), without even needing to query them at runtime. An austere implementation could be entirely serial, operating like a standard for-loop over an x87, within a physical register set barely larger than that required to support the architectural scalar registers. A high-performance implementation might, in extreme cases, farm the pipeline out to something like a GPU - an idea which would certainly prick AMD's ears up.

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2016-02-03 01:36
Thanks for your input.

I have no problem with using "word" to mean 32 bits and "halfword" for 16 bits.

Regarding code density. The problems with 16-bit instructions are many. You can't have 3-register instructions. You don't have space for specifying operand size and type, vector size, predicate or mask, rounding mode, exception handling, and all the other features that may be needed in the future. And quite importantly: memory address offsets and immediate constants have odd sizes in a 16-bit coding scheme. This causes problems in linkers and loaders when the offsets overflow, and it causes problems for the high-level language programmer who may not know whether a constant will fit into an instruction.

With a 16-bit minimum instruction word size, you will waste more bits for specifying instruction size. And instruction decoding will be a bottleneck like it is in x86. The nice thing about my proposal of 32-bit instruction words is that it allows a completely orthogonal instruction set. Any instruction can be specified with a register operand or a memory operand or an immediate operand - all with the same size, so that you can be certain that any value will fit into any of these, while small values can still be fit into smaller instructions to save code cache space.

If code density is important, then I can suggest a compromise. Allow two tiny instructions to fit into a 32-bit code word. The first 4 bits of the 32-bit word indicate that this is a double instruction, followed by two tiny instructions of 14 bits each. These tiny instructions obviously don't need any bits for specifying instruction size. They can be used for the most common simple instructions with one or two registers. A disadvantage is that you cannot jump to the second instruction of such a pair of tiny instructions. All jump offsets are still scaled by the standard instruction word size of 4 bytes.

Regarding your proposal of pipelined "string" instructions. They will have to either include memory addresses and work on the level-1 cache or use a register stack of fixed size. If your code has multiple accumulators or vectors then you need multiple register stacks. This sounds quite complicated to me. I am not sure I understand your idea.

   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-02-12 06:25
Orthogonality is a good argument - I just thought it odd to discuss code density so much, and then do very little to improve it over, say, PowerPC. Whatever faults ARM may have, it does have far better code density than PowerPC - *even if* you ignore the 16-bit compressed format it supports. Some of this has to come from its support of, for example, a conditional-shifted-add in one instruction, where PowerPC must use three: either conditional-branch, shift, add; or shift, add, select.

The pipelined-string idea is a bit new and strange, that's true. I'll try to explain it a bit more.

There is only one stack, and it is only used at pipeline-setup time. It's really a stack of pointers into the vector register bank, which exists only as rename registers and is not directly accessible (which is why user programs won't need to know how big it is). The pipeline-setup engine essentially converts the pipeline instructions into an SSA basic block - in hardware - and stores the resulting uops (or whatever) in an internal buffer. Because this is only done at setup time, it can be relatively slow - one to three cycles per instruction, say - with the goal of maximum execution speed subsequently. The stack can thus be as large as can be justified, and will certainly be smaller than the actual register bank.

Memory addresses are collected from the scalar register bank at pipeline-execution time, and auto-increment versions of memory access instructions would be provided in the pipeline, inherently providing unambiguous prefetch hints. The same can be done with scalar arguments to the vector operation. If the updated version of auto-incremented scalars/addresses is defined to be written back to the scalar register bank after pipeline execution is complete, this might simplify restarting the pipeline after an interruption. If intermediate values in the pipeline can *also* be used as memory addresses, this would facilitate fast scatter/gather operations, which is a major weak spot of present SIMD architectures.

Complex pipelines will often need to use common-subexpressions and so on. The standard way to deal with this in a stack architecture is "duplicate" and "exchange" instructions, which should be familiar from x87. Because these would be handled entirely by renaming registers, which is done at setup time only, this would not affect throughput as it does with x87.

Obviously there are many details glossed over, but hopefully the gist of it is now clear.

   
Proposal for an ideal extensible instruction set
Author: Hubert Lamontagne Date: 2016-02-18 19:32
Hmm, I think this is definitely a CISC, and not a RISC-CISC compromise. It does have the one good-but-kinda-expensive feature of CISC: Load-ALU operations.

Some criticism of the proposal:

Instruction format:
For the instruction size, I agree with Jonathan Morton: the second most useful instruction size after 32bits is probably 16bits, because most code uses mostly instructions that would fit in 16bits, resulting in a size gain of theoretically up to 50%. This is why so much 32bit ARM code is compiled in THUMB mode: the same code typically runs 0%-15% faster because the smaller instruction stream compensates for the extra operations needed to cope with the small opcodes (nb: this has probably changed on newer cores!). Instruction sizes larger than 32bits are not that useful because immediates larger than 16bits are rare, and if you have enough registers, you can load oversized immediates beforehand, and remaining operations with large immediates can be decomposed into 2 or 3 operations (such as add r0, 0x3423; add r0, 0x6543 * 65536). Also, large immediates tend to be multiples of 2/4/8/16/32/etc.., which is why ARM's scheme of taking smaller immediates and bitshifting them works. The other alternative to 32/16bit mixed instruction size for reducing code cache size is cramming 2 or more operations per 32bit opcode (which might actually be a good idea!).

I'm also not sold on load-multiple operations. The reason for this is that operations that write to multiple sources are generally bad. From the point of view of the register renamer and out-of-order execution engine, that's 2..N renames to keep track of, and 2..N writebacks to the register file. This means that instruction issue has to stall because subsequent instructions have to wait for all these registers to get renamed. You lose the simplicity of having each instruction be single-result only. That being said, ARM64's compromise of allowing a 2 target load-pair but not more sounds acceptable to me (since ARM64 has to deal with other multi-result instructions anyways).


Registers:
As far as I can tell, separate register files are GOOD. The reason for this is that as you add more read and write ports to a register file, its size grows quadratically (or worse). This makes cpu components larger, which increases propagation time, and increases fanout, and multiplies the complexity of the register renamer. If you give floating point operands their own register file, then aside from load/store, compare and conversion operations, the FPU never has to interact with the rest of the core. So for the same amount of IPC, say, 2 integer 2 float per cycle, separating float operations means you go from a monstruous 8-read 4-write register file and renaming mechanism where both integer ALUs and FP ALUs have to be wired everywhere, to a 2-issue integer unit and a 2-issue FPU. The FPU can have its own register renaming unit, its own scheduler, its own register file, its own writeback unit, its own calculation latencies, and FPU ALUs can be directly wired to the registers, and the whole FPU can live on a different section of the chip. The front end can simply recognize which ops are FPU and queue them there. The same applies to SIMD.

The reason why the integer register file of the CPU isn't also split is that integer operations have a lot of interactions with each other and with loads/stores and jumps, and getting C++ compilers to recognize which partition to put every op/result in quickly turns into an NP-complete problem. The exception to this is Ivan Godard's Mill's belt, which in a 4 ALU design forces each ALU to only write to a different 1/4th of the registers. It might be possible to make a good case for the 68000's idea of separating pointers into a different register file - after all, the C++ compiler knows which operands are pointers. Yes, this increases the number of opcodes (bad), but it decreases the amount of operations competing for the same register ports and ALUs for some given workload (good).

For the whole operand size thing, aside from SIMD and loads/stores there's no reason to have 8 or 16 bit operations. Inversely, 64bit operations where one of the operands is 32bits and gets sign-extended or zero-extended to 64bits are justifiable (ARM64 has them), including in address computations (there's tons of C++ code that does something like array[int index]). Some 32bit operations can be done in 64bit while ignoring the top bits (add, sub, mul, and, or, xor, shl) but not all (shr, asr, comparisons).


Flags:
Add-with-carry and subtract-with-borrow I think are unnecessary because they can be faked with 3 simple operations: add X with Y, compare the sum with X and output 1 if lower but 0 if larger (SLTU on MIPS), add comparison output to sum. ADC and SBC operations are problematic because they're really 3-input 2-output operations (bad), which means that they'll probably have to be broken down into 2 micro-ops which means you'll probably see little gain over the 3 instruction sequence.


Predicate registers:
I'm definitely not sold on the whole predicate thing. As far as I can tell, compilers really don't like issuing conditionals as anything other than conditional branches. Also, if the conditionals can be accurately predicted, then conditional branches are faster because you only execute one side of the branch, and operations downstream can get their inputs earlier (by register renaming, instead of waiting for the predicated instruction results). For remaining cases, a separate CMOV instruction sounds a lot more justifiable to me than spending 3 bits of every single opcode. Also, remember that predicated operations (3-input) are fundamentally different operations from non-predicated operations (2-input), since the old value can be propagated instead of simply being tossed so it needs to be present as an ALU input.


Rounding mode:
I agree that float-to-integer conversions must support at least truncation for the C++ (int) cast, plus floor() and ceil() and round(). Actually it would probably be useful to have an opcode that does floor() or ceil() without the integer conversion as well (for linear interpolation).


Exception control:
My point of view on exceptions is that they're generally bad, since they can basically turn any ALU operation into a potential conditional jump. This forces you to keep the CPU state at the moment of that operation until you're sure that the ALU operation went the right way. Also, they are useless for running C++ code. (Ok, to be fair, you already need to deal with potential page faults on every single load/store so it's not really that much more work, but this is definitely not the kind of thing I'd want to encourage)


Zero masked data:
The reason I can see for not putting this one in is that non-zero-masked ALU operations are actually very different operations and rather complex, since they prevent register renaming and forces you to implement 3-input result merging versions of basically every ALU operation (similar to the predicated operation above). These result-merging operations will probably see little use aside from manipulation tricks in hand-written assembly.


FPGA:
This is probably an okay idea for platforms like game consoles that essentially run a single program. This reminds me of the DSP on stuff like the N64, which you could rewrite the bytecode for (which only Factor5 ever did if I'm not mistaken). But otherwise, I think this is essentially impossible to task switch: you'd need to build the whole FPGA state to be loadable/storable which would probably make the performance pretty bad.


Ok, my turn with suggestions!

2 x ALU Instructions (sequential!):
Your 3rd operand can either be an immediate, register value or memory loaded value. I suggest adding a 4th option: letting this 3rd operand be the result of a simple 2-operand math operation (ie something like add, sub, and, or, xor, bitshifts, maybe mul...), potentially with a small immediate as 2nd operand. ARM already has something similar to this (except the only operation you can do is a bitshift), and the multiply-accumulate is also similar, and load-ALU operations can also be seen as a version of this. This is a very common sequence: a LOT of code is made out of 2-7 successive ALU operations on the same operand. The cost of this is that this is a 2-cycle latency operation reading 3 register ports. The benefits is that you're getting 2 instructions for the price of 1: you can squeeze it in a 32-bit opcode (=increased code density without resorting to 16bit opcodes), it's only 1 instruction for the front-end, it saves 1 register read over the equivalent 2 instruction RISC sequence (3 reads instead of 4). But most importantly, this saves 1 register WRITE, which lets you reduce the number of register renames for a given block of operations and reduce the number of write ports to your register file.

Software pipelining:
For software pipelining, I actually have an interesting design to propose, which is similar to Jonathan Morton's proposal but doesn't use a stack. I like to call "SIME" (single instruction multiple execution). For a 8*32bit SIME unit, you'd build it as 8 simple MIPS-like cores, each 1 instruction-per-cycle in-order. But only the first core has a front-end: once the first core executes an instruction, it queues the same instruction to the second core (which will in turn queue it to the 3rd core etc). For values involving feedback (for instance an accumulator), you also have a data queue going from the 1st core to the 2nd, and from 2nd core to 3rd, and so forth, with a queue from 8th core to 1st to provide looping, and ALU operands can come from either registers or from the queue from the previous core, and likewise results can be queued to the next core in addition to being written to the register files. All load/store operations are inherently gather/scatter operations since they execute sequentially on each successive core. For conditionals, cores 2-8 check that the conditional being evaluated produces the same result as on core 1, and if it doesn't, some fallback mechanism is activated (this is generally used to do a number of iterations that isn't a multiple of 8). When an interrupt/task switch/page fault happens, the OS needs special opcodes to load/store values from the instruction queues and data queues to save/restore the state. This system could be extended for larger CPUs by either making it superscalar (2-issue in order or even out-of-order), adding SIMD on top of SIME or adding more cores (it's probably not too hard to design it so that the number of cores can be changed without changing the instruction stream, aside from OS state loading/saving). Unfortunately I don't think C++ compilers can automatically produce the kind of loop that would run on this, due to the fact that memory loads/stores are reordered (unless either pointer aliasing detection gets much better, or a Transmeta Crusoe-style load/store aliasing resolution mechanism is provided).

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2016-02-21 02:48
Thank you Hubert for your detailed feedback.

Regarding 16-bit instruction size: I don't think 16-bit is the optimal instruction word size for larger systems, and Moore's law is still making systems larger (it may slow down a little, but it hasn't stopped yet). A 16-bit instruction size means small immediate constants and small address offsets with odd sizes. Most memory addressing should be relative to the instruction pointer or the stack pointer. Using a double-size instruction (2*16 bits), you have 16 bits for address offset. This will give overflow during relocation in the linker or loader if the combined size of code + static data exceeds 32 kbytes (the offset is signed). Relocation overflows happened quite often in the old DOS days, and applications haven't become smaller since then. Most PC applications today are bigger than 32 kbytes, so you need 32 bit address offsets (or ugly memory segmenting). With 16-bit instruction words, you will need at least 5 instruction words for all instructions that access static memory. This means complicated instruction-length decoding. This is one of the reasons for my proposed compromise of a 32-bit instruction word size, and allowing two tiny instructions in a 32-bit instruction word.

Load/store multiple registers instruction: I am imagining that this instruction will be decoded into multiple micro-ops. The only purpose is to save code space.

Add with carry: Your proposal removes the carry output but not the carry input. It will be very complicated if you also remove the carry input: add A+B, generate carry out, add carry in, generate another carry out, add the two carry outs. This is 5 instructions instead of one. Add-with-carry is typically used in high precision math with long chains of add-with-carry. The latency of such a chain will be much longer if you don't have an ADC instruction. Most contemporary instruction sets have two outputs anyway: target register and flags.

Separate register files: My proposal does not have separate registers for integer and floating point - it has separate registers for scalars and vectors. Both can handle integer and floating point. Do you want to split it into four register sets: integer scalar, float scalar, integer vector, float vector? This will require a lot of cross couplings and extra instructions for converting between these. I think that I have more focus on vector instructions (SIMD) than you have. Performance-critical applications are increasingly using vector instructions because this is an efficient way to boost performance. I agree that typical non-vector code has few couplings between integer registers and floating point registers, except for address pointers. But vectorized software has more such couplings, especially for masks, but also manipulation of sign bits etc. Many instructions are the same for integer vectors and floating point vectors, as I have argued before: move, broadcast, blend, permute, gather, scatter.

Predicated instructions: I agree that conditional jumps are faster than predicated instructions if the jump is predicted correctly (but good branch prediction is very expensive in terms of hardware and power consumption). I included predication mainly for the sake of orthogonality between scalar and vector instructions. Masking is indispensable in vector code because you cannot make branches on a per-element basis. Predication is the scalar equivalent of vector masking.

8-bit and 16-bit ALU instructions: These are necessary in vector code, and so they are automatically included in scalar code as well. It would be a waste of power to use 64-bit ALU instructions for everything.

2*ALU instructions: Good idea. We already have multiply-and-add instructions. Double add instructions and shift-and-add instructions would be quite useful as well. Most x86 compilers are actually doing all kind of tricks with the LEA instruction (intended for address calculation) for doing two or three things with one instruction.

Exceptions: Yes, exceptions is a bad thing. It requires a lot of complicated machinery in both hardware and software. We can avoid the need for floating point exceptions by propagating INF and NAN values from the point of error to the final result of a calculation. The IEEE floating point standard includes an error code in the NAN which is propagated through the calculations. (The error codes should be OR'ed when two NAN values are added. Unfortunately, many microprocessors today fail to do this and only propagate one of the two NAN codes). It would be nice to have a mechanism for detecting errors in integer code as well, but I don't know how to do it. Most systems today generate an exception for integer division overflow, but not for overflow in integer addition and multiplication. This is illogical. We also need an efficient way of detecting if an array index is out of bounds.

   
Proposal for an ideal extensible instruction set
Author: Hubert Lamontagne Date: 2016-02-22 13:48
Regarding 16-bit instruction size:
I admit, 16 bit instructions mostly made sense on RISCs that had to contend with having no instruction cache - in other words, ARM and SuperH. On a 32/16bit mixed size architecture, they'd exist mostly for series of arithmetic instructions operating on registers or very small immediates, and having 2-in-1 ALU operations solves this problem and has other benefits. But still, if you're going to go to the length of having a prefetch queue and barrel shifter for 4/8/12 byte variable instruction size, to me it doesn't seem like adding 2 byte increments is that much more work.

Load/store multiple registers instruction:
I guess that one depends on just how many microcoded instructions you have to deal with. To me, I think the #1 priority is making out-of-order C++ run as fast as possible, and setting a "single result register per instruction" limit helps a lot with this goal, because it removes a lot of degenerate cases like "4 two-result instructions in one cycle" (which means 8x register rename at front-end and 8x register writeback at back-end - this is bad).

Add with Carry:
It's true that if you want to do lots of BIGNUM computation, then you'll definitely want a flags registers (and 64x64->128 multiplies), whereas if you're doing general purpose C++ on an out-of-order cpu, you never need flags and multi-result instructions just aren't worth the extra trouble (which is why MIPS never had them). I have a weird suggestion here: BIGNUM computation will probably always happen on the SIMD unit, so you probably only want flags on the SIMD unit. Or perhaps you could use scalar integer registers as flags on SIMD operations designed for BIGNUMs.

Separate Register Files:
For SIMD code you could definitely have both float and integer vectors in the same register file, and a lot of instruction sets do this. In fact, you could have a register set for SIMD integer+SIMD float+scalar float (this is what ARM does) and it would be usable. I guess I was arguing specifically for not mixing scalar integers and scalar floats.

Predicated instructions / 8-bit and 16-bit ALU instructions:
I think that integer scalar and SIMD operations shouldn't be orthogonal. They don't really need to be, and they're optimized for different things (C++ compiler code and heavy out-of-order execution for integer scalar, maximum throughput at the cost of increased latency for SIMD). So predication and 8/16 bit data should probably be restricted to SIMD units.

Exceptions:
This is the kind of stuff that does pop up occasionally: MIPS and PA-RISC have versions of ADD that trigger overflow interrupts, x86 has the BOUND instruction (but CPUs typically don't optimize for it - for instance it issues on the vector path on the Athlon and it has 6 cycle latency). The problem is that C++ doesn't use these (+ - * are expected to wrap, and the way C++ conflates arrays and pointers prevent bounds checking most of the time), and higher-level languages typically have to do fancy fallbacks (try {} catch() and so forth) which precludes something as blunt as an interrupt. BOUND is essentially racing against an easy-to-predict conditional branch (which becomes free if the CPU isn't issuing full IPC at any point in the loop).

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2016-02-23 02:12
Hubert Lamontagne wrote:
Add with Carry:
It's true that if you want to do lots of BIGNUM computation, then you'll definitely want a flags registers [...] Or perhaps you could use scalar integer registers as flags on SIMD operations designed for BIGNUMs.

Most CPUs today have a flags register which is updated at every ALU instruction, even if it is rarely used. So most ALU instructions have two result registers in current designs. The flags register also needs renaming. My proposal has fewer two-result instructions than most current designs because you only have a flags output when you explicitly ask for it, and you don't need flags for conditional jumps.

If you want to eliminate two-result instructions completely, then we have a problem with add-with-carry. I am not sure I understand your proposal. Do you want to do a bignum computation using the whole vector register as one huge integer? This would be tempting indeed, but it can't be done in a single clock cycle unless you implement some very heavy carry-look-ahead circuitry. And you will still have a two-result instruction.

Some kind of software pipelining might be the most efficient solution to the add-with-carry problem, but I am not sure we have found a software pipelining model that is sufficiently flexible and not too complicated. Maybe the vector registers can be used as software pipelines so that the data are shifted one vector position at each clock tick.

Regarding 16-bit instruction size:
Instruction length decoding is a serious bottleneck in x86 processors. We should have as few different instruction lengths as possible.

Separate Register Files:
The vector registers cannot have callee-save status because the length is variable. This is a problem if you want to reserve the scalar registers for integer instructions only and do all floating point operations in vector registers. Then you have no floating point register with callee-save status, unless you define callee-save status for part of the register only (x64 Windows does this).

Exceptions:
The x86 BOUND instruction was removed in 64-bit mode, apparently because it was inefficient and rarely used (the code byte has been reused for something else). Checking array bounds can be done quite simple with a compare-and-conditional-jump instruction, which is already included in my proposed instruction set. If you are using unsigned compare then you don't need to check the lower bound (assuming that it is zero).

Checking for integer overflow is very complicated in high-level languages (see http://stackoverflow.com/questions/199333/how-to-detect-integer-overflow-in-c-c). We could make it easier by improving support for checking integer overflow in the instruction set. I have already proposed to use the predicate/mask register for flags also. We could add two more flags bits which are accumulating, i.e. the signed and unsigned overflow condition is ORed with the previous value of the flag bit. The program can then check these accumulating overflow flags after a series of instructions. The compiler may use multiple registers for overflow flags in cases where the extra dependencies would prevent out-of-order execution. These flags registers are then ORed together in the end when you need to check for overflow. This mechanism might also be used for floating point errors as an alternative to the NAN-propagation mechanism. High level language support can easily be implemented with try-catch statements:

try {
  ... a long series of calculations ...
}
catch (signed_integer_overflow e) {
  ... error message ...
}
I don't know if this is sufficiently useful to justify the cost of having more two-result instructions.
   
Proposal for an ideal extensible instruction set
Author: Hubert Lamontagne Date: 2016-02-23 14:47
Add with carry:
SIMD operations can be run with higher instruction latency than integer instructions (which have to run in 1 cycle or else they tend to bottleneck everything else). For instance, VADD has a 3~4 cycle latency on ARM. BIGNUM processing tends to have other longer latency operations like large 64x64->128 multiplications, so you could live with a 3~4+ cycle vector ADC as well. ADC could be a 4 input instruction: operand_a, operand_b, operand_a_of_previous_computation, output_of_previous_computation (instructions with lots of inputs are relatively common on SIMD instruction sets). This can even be chained, and the CPU's register renaming engine can totally take care of adc-to-adc dependencies.

Among 'modern' architectures (which I'd define as 'architectures that have at least 1 fast out-of-order implementation'), MIPS doesn't have flags at all, Dec Alpha doesn't have flags at all, PA-RISC has a couple bits in the processor status word but conditional branches don't use those (carry flag strictly for adc/sbc/add*/sub*, multi-step division flag, nullify flag that skips over next instruction), ARM has flags but only some instructions set the flags (CMPS, SUBS, arm32 instructions with the 'S' bit) and it doesn't have partial flag updates, x86 infamously has flag partial updates on every ALU instruction (which means it needs multiple aggressive rename units), POWER has an 8 field condition register with ALU ops optionally updating field 0 and CMP updating a selected field (in addition to a count register), Itanium has the 64 single-bit predicate registers (supposedly the one thing that prevented an Intel team from making an out-of-order Itanium!).

So I guess it's a bit of a wash but I don't think flag registers make cpus faster (Alpha didn't need flags to be fast!).

Regarding 16-bit instruction size:
Agreed, multiple instruction size is bad unless you have no choice. I'd still argue for a single 4-byte instruction format: lots of fast architectures use it (Alpha, MIPS, PA-RISC, Power, ARM64), instructions with large immediates are rare and they are generally easy to split into multiple instructions. Adding 8-byte and 12-byte instructions doesn't sound like a large increase in complexity, but it is: it means instructions can span more than one cache line (= you need a prefetch buffer = your pipeline becomes at least 1 or 2 cycles longer), the second instruction of an issue group can be located in multiple different positions which means you need more multiplexers (+0, +4, +8 bytes) and this problem increases for every successive instruction (the 4th instruction can be at +0, +4, +8, +12, +16, +20, +24), it adds pipeline stall checks for cases where there are simply too many large instructions and the icache can't keep up.

Separate Register Files:
Then it's probably best to have an integer register file, floating point register file, and vector register file yes.

Exceptions:
I still think that's spending an awful lot of silicon in parts of the cpu that are the most sensitive to timing, for something that I think isn't going to see any use because it isn't even in C++ aside from intrinsics, it prevents the compiler from reordering SSA (it makes + non associative!), and it can be simulated with a couple extra MIPS-style ops.

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2016-02-24 00:58
Add with carry:
Your idea of a 4-input add with carry is interesting. I see a few problems, though:
* You will need more space in the opcode to contain 4 registers.
* Handling multiple inputs is just as difficult as multiple outputs. For many years, Intel had a limitation of 2 inputs per micro-operation, and they had to split add-with-carry and several other instructions into 2 microoperations for the same reason.
* The instruction might have a latency of 2 clock cycles unless you can implement it with two double-speed adders. Mixing instructions with different latencies is a problem because a 2-clock instruction may need the result bus at the same time as a subsequent 1-clock instruction. It is best to standardize instruction latencies and have as few different latencies as possible.
* It is problematic to make a hardware design that cannot handle instructions with two outputs. You also need two outputs for integer division (output quotient and remainder) and for full length integer multiplication.

I can see the following possibilities for implementing 2-output functions:
1. Use one instruction with two output registers.
2. Use two separate instructions, possibly executed simultaneously, one for each output.
3. Use two elements of a vector register.

Method 1 would certainly be the most efficient and straightforward solution. We just have to weigh the hardware costs versus the benefits.
Method 2 is less efficient. For example, for integer division, you cannot make two divisions simultaneously, or even pipelined, unless you double the hardware.
Method 3 might be an efficient solution for a scalar add-with-carry chain, but for other purposes, it will complicate the software when you have to split and join data into vector elements. It becomes more complicated when you want to vectorize vectors. You may use even-numbered vector elements for addend and sum, and odd-numbered vector elements for carry. This will complicate both hardware and software, and the throughput will be half of what you would get with method 1.

Instruction length:
As I argued before, you need 32 bits for address offset, so you must allow instructions of two 32-bit words.

Many instruction sets don't allow big immediate constants. For example, to load a 32-bit constant you need a memory operand with a 32-bit offset. My argument is that it is more efficient to have a 32-bit immediate operand than a 32-bit offset to a 32-bit memory operand. This will reduce the loading on the data cache. Remember that cache misses are very expensive.

In my analysis I found that you may need instructions of three 32-bit words to accommodate all the bells and whistles of vector instructions with a memory operand, variable vector length, mask, etc. There may be more needs for long instructions in the future as new features are invented. At least the current trend goes towards putting more features into a single instruction to get higher overall performance. This is the reason for my decision to allow instruction lengths of one, two and three 32-bit words. This is certainly a compromise, since instruction length decoding becomes more expensive the more different instruction lengths you have.

If we decide to allow an instruction length of 3*32 bits then we can afford the luxury of allowing immediate constants of 64 bits, for example a double precision float or a 64-bit absolute address.

On the other hand, if we limit the instruction length to 2*32 bits, then there will be certain instructions that cannot have a memory operand with 32-bit offset, and we will need two instructions to load a 64-bit immediate constant. This would still be a viable solution, but I suspect that a patch would be added in the future when the need for more instruction bits arise as more features are added. Remember how many patches have been added to old instruction sets.

   
Proposal for an ideal extensible instruction set
Author: asdf Date: 2016-02-24 04:08
> Checking for integer overflow is very complicated in high-level languages

How about you support 2 kinds of arithmetic - saturation and modulo. To check if the arithmetic operation overflowed, do this:

c = a modulo_op b;
d = a sat_op b;
if (c != d) { overflow }

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2016-02-24 13:04
asdf wrote:
How about you support 2 kinds of arithmetic - saturation and modulo. To check if the arithmetic operation overflowed, do this:
c = a modulo_op b;
d = a sat_op b;
if (c != d) { overflow }
This can easily be implemented. x86-SSE2 has saturated addition.

It is less efficient than my proposal, though, because you need almost 3 times as many instructions.

The method doesn't work for multiplication, though. It can happen that c = d after an overflowed multiplication.

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2016-02-25 02:13
Agner wrote:
We could add two more flags bits which are accumulating, i.e. the signed and unsigned overflow condition is ORed with the previous value of the flag bit. The program can then check these accumulating overflow flags after a series of instructions.
I wonder if we can get rid of the floating point control word and floating point exceptions with this method.

First, the rounding mode should be specified in the instruction that needs it, not in a global control word. Only few instructions actually need a specified rounding mode, most importantly float to integer conversion. In the rare case where you need a specified rounding mode for addition, multiplication, etc. you will have to use a long version of the instruction with all the option bits.

Floating point errors can be detected in most cases by the INF and NAN propagation mechanism. In the cases where a more detailed error detection is needed, we will specify a combined predication/flags register in the instruction. Bit zero of this register is the predicate. A few more bits are the traditional flags: zero, carry, sign, overflow. And then I want to add a few accumulating error bits which indicate the error condition ORed with the previous value of the same bit. This mechanism also works with vectors, where the flags register is a vector register.

The advantage of this proposal is that we get rid of the floating point control/status register and floating point exceptions. The disadvantage is that we get an extra input and output dependence when a predication/mask/flags register is specified.

   
limit instruction length to power of 2
Author:  Date: 2016-02-24 12:49
How about limiting instruction length to power of 2 - 1 byte, 2 byte, (not 3 byte,) 4 byte, (not 5,6,7 byte,) 8 byte, 16 byte and so on?

We know, although current x86 instruction length is nightmare - vary from 1 byte to 15 byte increasing by 1 byte, aligning instructions to 16 or 32 byte boundary helps fetch unit of x86 processors.
Data types of C language already obey this rule.
There are 1 byte("char"), 2 byte(typically "short int", "wchar"), 4 byte(typically "int", "float"), 8 byte (typically "long int", "double"), but not 3 byte primitive types.
Then these data is aligned along the boundary of their size.
if instructions are aligned along power of 2 boundary too, we can use efficient fetcher (and maybe decoder).

As long as processors use binary notation, power of 2 is primitive size for processors.
So extending instruction by power of 2 costs few efficiency.
I think instructions with 12 byte length will be same disaster in future as Intel introduced 3 byte instructions to be the disaster today.

   
limit instruction length to power of 2
Author: Agner Date: 2016-02-24 12:57
A-11 wrote:
How about limiting instruction length to power of 2
If you have, for example, 8B - 2B - 8B, then the second 8B instruction will be misaligned, and the advantage disappears.
   
Any techniques for more than 2 loads per cycle?
Author: Hubert Lamontagne Date: 2016-02-24 17:50
Going to go on a tangent here but how could gather/scatter be implemented in hardware? The 'traditional' way to implement data cache seems to be to have 2 read ports and 1 write port with banking (and if your 2 loads fall on the same bank you only get 1 load), with aggressive reordering, but obviously this limits scatter/gather issue width a lot (probably making anything more than 4 or 8-way gather/scatter useless). Increasing the number of L1 ports causes tons of problems:

- It makes bank selection for loads more complex, potentially increasing load latency by a cycle (presumably from 3 cycles to 4 cycles with address calculation included) due to having more multiplexers on address inputs on each bank, more multiplexers on writebacks, more different stalling scenarios and probably requiring an increase in the number of banks.

- It makes load/store address conflict detection harder since you need to check even more reads against pending writes in the write buffer, and deal with more scenarios like multiple reads trying to access forwarded store values.

I've played around with various concepts to deal with this but I'm not sure I've found anything really interesting yet:

- A L0 cache could be introduced. Probably something very small, single-way, duplicated multiple times, probably loading whole cache lines from L1 on every miss and probably only used when there are too many loads per cycle to be satisfied by the L1. Problems: this is still limited to 1 store per cycle, filling values from L1 competes with stores for the single write port, doesn't simplify address conflict detection with the store queue. (if I'm not mistaken, GPUs use something like this?)

- Pointers could be stored in special registers, and when a pointer register is updated, data from nearby addresses (say, possibly something like adr+0 to adr+63) are automatically pre-read into registers, and there is an automatic check that none of the other pointer registers are pointing to the same data with data modifications. You would possibly also have load/store instructions that bypass these special pointer registers (but with address conflict checking). This is very complex (especially the address checking, which is unfortunately necessary for C++ compilers), and it doesn't help you at all if your data is widely spaced or uses indexed offsets (register+register*n). But on the other hand, data accesses that do fall into this pattern (like loading/storing a whole bunch of contiguous stack addresses or object member variables) become register accesses, they can be renamed, reordered willy-nilly, pretty much every instruction can load/store a value, misaligned addresses don't matter anymore (except when changing a pointer register), and if the address is divisible by 64 you can conceivably load/store a whole cache line in one go.

   
Any techniques for more than 2 loads per cycle?
Author: Agner Date: 2016-02-25 01:40
Hubert Lamontagne wrote:
how could gather/scatter be implemented in hardware?
I don't know any microprocessor that can gather/scatter all vector elements simultaneously. It will use multiple clock cycles and gather one - or at most two - vector elements per clock cycle.

I am using a trick when the data to gather are not too distant from each other in memory: Read contiguous data into the largest available vector register, and then use a permute instruction to get the data into the desired positions in the vector. We should of cause have efficient permute instructions that can move data from any vector position to any other vector position. The indexes for permutations are provided in another vector register. An index out of range should produce a zero, so that larger permutes can be produced by ORing the results of multiple permute instructions.

   
limit instruction length to power of 2
Author:  Date: 2016-02-25 07:20
If you have, for example, 8B - 2B - 8B, then the second 8B instruction will be misaligned, and the advantage disappears.
Same thing happens at data structure.
struct {
double a /* 8B */;
short b /* 2B */;
double c /* 8B */;
} foo_t;
For this structure, some C compiler arranges "8B(a) : 2B(b) : 8B(c - misaligned!)".
But most compilers generate "8B(a) : 2B(b) : 6B(padding) : 8B(c)" for this data block.
So I reply "8B - 2B - 2B(NOP) - 4B(NOP) - 8B" for the instruction case.
Deffer from C structure where this language prohibits rearranging member order, compilers can reorder instrutions to bury NOP paddings.
For example, from "8B - 2B - 2B(NOP) - 4B(NOP) - 8B - 4B - 2B" to "8B - 8B - 2B - 2B -4B".
So we must worry about 6B for 2 NOPs only in case this block is separated by jumps, where we would pad NOPs anyway for instruction alignment.

I think what we must worry is not only explicit NOPs above, but also implicit unused bits in a instruction.
MIPS-I has 4 fields of 5 bit width for each instruction.
But because of 3 operands architecture, it tends to use only 3 fields leaving 1 field unused.
In my proposal, your 12B instructions have to bloat till 16B, which imply wasting 4B.
According to data compression theory, as predictable these bits is, they holds as less information.
I guess this is a reason why RISCs have lower code density.
It's the trade-off between density and fetch speed of code.
Also from the theory, as we stuff more information in bits, these bits are less predictable = more randomized.
The nightmare of x86 random format might be a proof of high code compression ability.

My understanding for "extensible instruction set" is how to keep old instruction set in the future which is new instruction set we are designing today.
Like you, I also can't imagine the day when 32bit alignment is too small.
But Intel also could not too, and they supposed 8bit alignment is enough, resulting today's nightmare.
So I think we must not imagine enough scale of alignment.
Exponential size is scale-free like fractal.

   
limit instruction length to power of 2
Author: Hubert Lamontagne Date: 2016-02-25 10:23
Agner:
Doing one large load then using parts of it is cool yes, although I kinda expect that it's hard to end up with a net speed gain over simple scalar code when doing that sort of thing.

A-11:
That adds some hard decisions in C++ compilers : it forces the compiler to potentially reorder things (for instance, grouping 2B instructions together), possibly encourages the compiler to heavily favor small instructions (maybe even breaking down large 8B instructions into two or three 2B instructions). Though it also has some benefits: it gives you multiple instruction size without needing a prefetch buffer and it's easy to decode. And I guess you could design it so that 2B instructions generate a single micro op, 4B instructions can generate two micro ops, and 8B instructions can generate 4 micro ops, which would let you align instruction queue inputs to instruction cache outputs, and multi-output instructions could be forced to use larger encodings to ration register write ports.

   
More ideas
Author: Agner Date: 2016-03-04 11:16
I have an idea that would make it very easy to optimize array loops:

Define an addressing mode [register1 - register2]
Specify vector length (in bytes) in a register. If the specified value is higher than the maximum vector length supported by the processor then the maximum length is used.

Now we can loop through an array in this way:

P = address of array
J = size of array (in bytes)
L = maximum vector length (depends on processor)
X = a vector register
P += J;   // point to end of array
while (J > 0) {
   X = whatever_operation[P-J]{vectorlength J}
   J -= L
}
Here, J has the triple function of loop counter, array index, and vector length. The array size does not have to be a multiple of the vector size: The last iteration of the loop will automatically use a lower vector length if required, and no extra instructions are required to calculate the remaining size. We have completely got rid of the extra code that is typically needed to handle the remaining array elements when the size is not a multiple of the vector length. The code will work optimally on different processors with different maximum vector lengths. There is no need to recompile the code when a new processor with a different vector length appears on the market. Obviously, we can read and write any number of arrays inside the loop, using the same method.

If we don't want to have too many different addressing modes, we can maybe ditch the addressing mode with a scaled index register, assuming that the above method will be used for most loops.

And one more proposal. There is a trend to add more and more feature bits to instruction codes, such as rounding mode, exception control, broadcasting, permutation, shifting, zeroing. This makes instruction codes longer, and it is a waste of code cache size because most of these bits are rarely used. I will propose to put some of these extra feature bits into a register. I have already specified a predicate/mask register in my initial proposal. The extra feature bits will be specified in the same register. Now, we will have only one "enable-features" bit in the instruction code, which enables the extra feature bits in the register. If the "enable-features" bit is zero then all but the predicate/mask bit in the register are ignored.

All unused bits in the features/predicate/mask register are reserved for future use, and must be zero.

Some of the bits in the features/predicate/mask register can be output bits. They can be used for flags (carry, zero, sign, overflow) and accumulating error flags as I proposed in a previous post. The output bits will be unchanged when the "enable-features" bit is zero in order to save a register renaming.

Only features that influence the scheduler and renamer need to be hard-coded into the instruction. Some of the feature bits will not be available for vectors with 8-bit granularity if we don't have enough bits.

   
More ideas
Author: Hubert Lamontagne Date: 2016-03-07 10:57
I love your idea of using the remaining number of iterations, clamping it to the SIMD width and using that as a per-iteration width. I have to admit that's how a lot of my block processing code looks:

for(int i=0; i<nb_samples_to_do;)
{
int block_samples = nb_samples_to_do - i;
if(block_samples > 64) { block_samples = 64; }
[process block_samples items];
i += block_samples;
}

I think the scaled indexed addressing mode is mainly there for another reason: reading look-up tables, and other cases where the array index is calculated on the fly inside the loop (2D texture mapping, audio resampling and so forth). This addressing mode mostly makes sense for scalar integer and floating-point operations though (and scatter/gather operations if you end up having those).

For SIMD code, you tend to have free leftover cycles on the integer scalar part of the cpu so I don't feel that addressing modes are all that important - on the ARM NEON code that I did, I could simply do pointer updates and recalculations for the addressing types that the NEON didn't allow because performance was limited by the NEON unit anyways. On the other hand, you can also allow fancier addressing modes like post-increments in SIMD code because they don't use the same register files (MIPS has this: integer loads/stores are register+offset ONLY, but floating point loads/stores also allow register+register*4 since it doesn't create the case where store operations need 3 input registers; ARM NEON has a post-increment addressing mode where the increment is the SIMD load width).

Come to think of it, would it make sense to adapt the code for different maximum SIMD vector length in the relocation pass? (ie when correcting all jump offsets when loading a DLL or doing address layout randomization to prevent hacking when loading executables)

   
More ideas
Author: Agner Date: 2016-03-08 01:52
Hubert Lamontagne wrote:
would it make sense to adapt the code for different maximum SIMD vector length in the relocation pass? (ie when correcting all jump offsets when loading a DLL or doing address layout randomization to prevent hacking when loading executables)
The Gnu loader actually has this feature, called gnu indirect function. It can make entries in the procedure linkage table (PLT) point to different versions of a function depending on, e.g., which instruction set is supported. This feature is useful, but poorly documented. My idea is that you need only one version of the code and it will work optimally on all processors regardless of their maximum vector size. With current systems, you have to make a new version of the software every time a new processor with an improved instruction set comes on the market.
   
More ideas
Author: Agner Date: 2016-03-09 10:47
The idea of supporting vector registers with variable length has important consequences for the instruction set architecture as well as for the entire ecosystem of compilers, function libraries, etc. I will discuss my thoughts about this here.

First, the register set. We have discussed whether there should be different registers for integers and floating point numbers, and for scalars and vectors. So far, the following solutions have been proposed:

1. One universal register set for everything.
2. Two register sets, one for scalars and one for vectors. Same registers are used for integers and floating point.
3. Two register sets, one for integer scalars and one for everything else: floating point scalars, integer vectors and floating point vectors.
4. Three register sets, one for integer scalars, one for floating point scalars and one for vectors of all types.

The reason for using the same vector registers for integers and floating point numbers is that they share many of the same instructions, as mentioned in a previous post.

If we assume that a lot of floating point code involves arrays and loops, then we must prioritize easy vectorization of floating point code. If we assume, furthermore, that a lot of floating point code contains calls to mathematical function libraries, then we must make these library calls vectorizable. Mathematical library functions such as sine or logarithm should have a variable-size vector as input and a similar variable-size vector as output. It will be simpler to use the same functions for scalars by specifying a vector length of one, rather than having separate function versions for scalars and vectors. This will make it easier for an optimizing compiler to convert function calls in scalar code to vector code. A consequence of this is that we should use the same register set for floating point scalars and floating point vectors.

A drawback of using vector registers for scalars is that vector registers cannot have callee-save status because the vector length is variable with no theoretical upper limit. We must find out if scalar non-vectorizable floating point code is sufficiently common to justify having a separate register set for floating point scalars. For integers, on the other hand, there is no doubt that scalar code is common. We need scalar integer registers for pointers, loop control, and all kinds of general code. This leaves us with option 3 above as probably the optimal solution: one register set for integer scalars, and another register set for floating point numbers and vectors.

The priority on vector support, variable-length vectors, and variable length vector functions has important consequences of the whole ecosystem of compilers, function libraries, etc. We must define an ABI standard that supports functions with variable-length vectors. If registers r9 - r15 are used for specifying vector length, as proposed, then it will be natural to use these registers also to specify the vector length of function parameters and function returns. If multiple vector parameters have the same length (in bytes), then they should use the same vector length register, r9. If multiple vector parameters have different length then they will use r9, r10, etc. If there are more than 9 scalar integer parameters before one or more variable-length vector parameters, then the vector length will have precedence over the scalar integer parameters for the use of r9 - r15.

The vector length is specified in bytes in the vector registers and in assembly code because this makes loops more efficient (we can use the same register for loop counter, array index and vector length as explained in my previous post). High level code differs from this by specifying the vector length as the number of vector elements. The compiler can easily translate this to bytes, like it is already doing for array indexes. A function that uses multiple vectors of different kinds should preferably have the same element size for all vectors, i.e. 64-bit integers if you have double precision floats.

This system also needs special support in compilers. As a minimum, we need a way of defining functions with variable-length vectors as parameters and as return value. Many contemporary compilers already have a way of specifying fixed-length vector registers as parameters and variables.

The problem that vector registers cannot have callee-save status can be met by making an addition to the object file format that allows a libraray function to specify which registers it is modifying. A compiler that supports whole-program optimization can use this information at the register allocation stage to avoid the need to save registers across calls to library functions with static linking.

   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-03-07 15:18
Very interesting Dr. Fog. I love clean-sheet approaches to ISAs and microarchitectures, since we've been stuck with the same dated monoculture for so long.

Regarding your proposal, what do you think about a transport-triggered architecture? Could your framework benefit from elements of a TTA? Instruction and code size could be reduced if certain registers automatically squared their contents, for example, or if designated pairs of registers automatically summed their contents. There are many possibilities. (Is there already an incrementer (by 1) register in existing architectures? Someone told me there was, but I've not seen it. That would be handy for some loops.)

Relatedly, do you think it would be worthwhile to have certain constants hardcoded into processor? I'm thinking of things like pi, Euler's, sqrt(2), and others. The actual constants to include would be best determined empirically, based on which are most used by programs in this era. Might there be useful speed-ups if a given register multiplied its contents by pi, for example? I have no idea what the physical engineering in the silicon would entail -- perhaps TTAs present difficulties?

Cheers,

Joe Duarte

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2016-03-08 02:11
Joe Duarte wrote:
what do you think about a transport-triggered architecture? Could your framework benefit from elements of a TTA? Instruction and code size could be reduced if certain registers automatically squared their contents, for example, or if designated pairs of registers automatically summed their contents. There are many possibilities. (Is there already an incrementer (by 1) register in existing architectures?
A full transport-triggered architecture, as defined by Wikipedia, is very dependent on timing; and the software needs to be recompiled for different CPUs with different latencies in the execution units. I don't see how it can handle out-of-order execution. I think the explicit parallelism will suffer every time there is a cache miss. Your idea is not as radical as this, but it will be very application specific. You cannot use a register that is tied to a specific ALU function in applications that don't need this function. Instructions with read and increment pointer are available in many architectures.

Hard-coded constants like pi etc. are available for the x87-style floating point registers in x86 processors, but these registers are now mainly obsolete and replaced by the vector registers, which don't have this feature. Apparently, the advantage was too small. My proposed instruction set allows immediate floating point constants of both single and double precision. I think this will serve the same need.

   
Proposal for an ideal extensible instruction set
Author: Hubert Lamontagne Date: 2016-03-08 12:39
Joe Duarte: Looking up transport-triggered-architectures.... They are interesting but I think they share most problems with VLIW:

Programs typically look kinda like this:
- Load from memory #1
- Long chain of dependent math #1
- Store to memory #1
- Load from memory #2
- Long chain of dependent math #2
- Store to memory #2

To run in parallel, you have to run the math chain #1 and #2 at the same time. Since chain #2 ops depend on load #2, you have to move load #2 up before store #1. This forces the compiler to prove that load #2 and store #1 can't possibly fall on the same memory address (this is what the LLVM alias analysis does), which turns out to be a hard problem often requiring global analysis and often fails. VLIW architectures often have software alias detection to deal with this: the Transmeta Crusoe had Load-Lock, Store-Check, Commit(+jump to fallback if commit fails); Itanium infamously had the ALAT where you'd do a ld.a (advanced load), then later on a ld.c or chk.a to confirm that the data from the ld.a isn't baloney and branch to fallback code if it is.

Out-of-order architectures are popular because they do this automatically for you (the store operations calculate the target address on the spot but can wait for the value for many cycles). Also, what if a load doesn't fall into L1 cache? On a VLIW (and, presumably, TTA, unless you made all your transport use queues), this is a hard stall. Out-of-order architectures can at least somewhat reorder operations around this - with some luck, hopefully it can find enough operations to do until it can get an L2 cache result.

Other problem is, as Agner said, that it's hard to adapt code written for a gen 1 TTA cpu to some presumably larger gen 2 TTA cpu - you'd probably need to more or less dynamically recompile it to the new wider cpu, which is easily as complex as current out-of-order RISCs and ARMs and x86s.

That being said, supposedly NVidia's Denver pulls off VLIW correctly and gives good perf, so I guess it is possible to make this work.

---

For built-in constants, that's not so useful because a lot of constants also have some scaling built-in (for instance result = sin(2.f * 3.141592653f / 256.f * i); ) and most of the time if your calculation involves pi or e, it involves some very slow op like sin() or exp() so having to load one more constant from cache won't slow down things appreciably.

   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-03-09 19:58
Thanks Agner and Hubert. That's good to know about constants – my intuition there was wrong I guess.

Agner, why 32 registers? That's an old norm, and thus seems somewhat arbitrary. Is there good empirical research on optimal register count for general purpose modern and foreseeable computing? All I've seen is stuff like this, focused on embedded systems: arxiv.org/ftp/arxiv/papers/1205/1205.1871.pdf By their measures, 80 registers looks good, but it doesn't make a big difference. I don't know if their method is valid, though, since it's not my field. 32 registers seems a bit low in your case since they serve as unified integer/float registers.

Papers I've read recently that you might find interesting:

1. The Idempotent Architecture, which eliminates recovery from mis-speculation. research.cs.wisc.edu/vertical/papers/2011/micro11-idem.pdf

2. ISA extension for hashing, with huge speedups: www.adms-conf.org/2014/adms14_arnold.pdf

3. Logarithmic number system processor as alternative to floating point unit: https://www.ece.ucsb.edu/~parhami/pubs_folder/parh13-asilo-log-arith-as-alt-to-flp.pdf

I like that you're proposing real progress in ISAs. I've been so disappointed in the laziness and lack of innovation in the industry with respect to instruction set architectures, operating systems, and systems programming languages. We've been in an x86, POSIX/Windows, and C rut for a very long time.

John Regehr had some interesting thoughts about what instructions we'd *discover* if we started from first principles and generated optimal instruction sets based on some starting assumptions about what humans need computers to do: blog.regehr.org/archives/669

Regarding your variable vector size, I wonder also about variable data bit-length. The classic doubling values of 8, 16, 32, 64, 128-bit, etc. seem arbitrary to me, and I wonder if a careful empirical investigation might tell us to use different sizes. Or perhaps variable bit lengths could be implemented as easily as variable vector sizes (which usually specify only a couple of allowed field lengths). I don't know what the processor hardware engineering implications of this are. We might discover an energy and performance sweet spot of, say, 24-bit integers or 40-bit floats for lots of applications, for example.

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2016-03-10 04:51
Joe Duarte wrote:
why 32 registers?
Thanks for the references. The article by Alipour seems to be about the physical register file used for renaming, not the number of logical registers.

If the compiler can do global register allocation then it can avoid spilling registers to memory when function1 calls function2 by using different registers in the two functions. The more registers we have, the deeper levels of nested function calls can we have without spilling registers to memory. But we should not forget that a typical program uses more than 99% of its execution time in the innermost loop. The innermost loop should not have more than at most 2 - 3 levels of function nesting in a well designed program. Register spilling outside the innermost loop is pretty irrelevant for performance. If we assume that each function uses a handful of registers and that some of these are used for short-lived variables that do not need to be saved across a function call, then the optimal number of registers might be something like 16.

If a typical instruction code has three register fields then we will need to use three more bits of instruction code for each time the number of registers is doubled. With 32 registers and three register fields, we will use 15 bits of the 32-bit code word only for specifying register operands.

The proposed register set includes vector registers of variable size, and the size may grow indefinitely in future implementations. Saving and restoring a vector register with variable size is quite complicated. First, you have to detect the maximum vector size, and then you have to allocate a corresponding space on the stack. It is good to have many vector registers in order to minimize the need for this procedure. Therefore, I think that 32 vector registers is reasonable. It will be difficult to extend the number of register in the future if the need should arise, so it is better to settle for too many than too few.

The number of scalar registers may be the same because scalar and vector instructions should be coded in the same way, according to my proposal.

   
Proposal for an ideal extensible instruction set
Author: Hubert Lamontagne Date: 2016-03-11 01:58
Joe Duarte:

32 registers kinda balances the need for smaller instructions and the fact that smaller register files are faster, smaller and soak up less power, with the fact that memory accesses are slow and complex. With register renaming, physical regfiles generally have at least 64 registers (MIPS R10k) if not way more (88*3 on Athlon, many more on hyper-threaded CPUs) so any less than only saves instruction bits. You also see stack-like register windows (SPARC, i960, Am29k) and rotating register files (Itanium) but the general consensus seems to be that this is overdesigned and MIPS does just as well with 32 ordinary registers. 16 registers is almost as good in typical code (see: ARM, x64) but the "cost" of increasing from 16 to 32 is low enough that architectures tend to go with 32.

Extremely wide in-order CPUs (ie VLIWs) might need more registers to keep all the values generated by software pipelining (Itanium illustrates this) but for "mainstream" designs this isn't considered to be a good plan (ie if you want to make a very wide core, you'll probably have to make it out-of-order to make it any faster than 2-instructions-per-cycle anyways).

Also note that it's very common to have different numbers of float and SIMD registers. For instance, ARM has 16 registers, but its FPU has 32 registers (for the Arm A8/A9/A15/etc fpu, shared with SIMD).

2^N variable sizes exist because you want to be able to calculate array memory addresses with a bitshift. If you allow 24bit integers for instance, then your memory calculation becomes [pointer + (index<<1) + index], not so convenient. And DRAM tends to come in multiples of 8 bits or 9 bits (for parity). Some DSP architectures use 24bit, 48bit and other unusual integer sizes.

The idea of idempotent instruction groups is interesting, and somewhat complementary to another different instruction grouping conceptual scheme I'm playing with (grouping chains of dependent instructions so that only the last instruction of the group writes to a register).

---

"We've been in an x86, POSIX/Windows, and C rut for a very long time."

This is for a good reason. The ~4 instruction per cycle out-of-order CPU is pretty hard to beat in terms of practicality and speed, and attempts to beat it face some pretty daunting challenges. Itanium was a valiant effort, but it failed and it just was never really faster than x86.

One big problem is that the L1 data cache will, at best, have 2 read ports and 1 write port, and that typical code often has 30% of memory loads/stores. This means that it's hard to get a speed gain when making a cpu that runs more than about 4 instructions per cycle.

The last DEC Alpha design was going to do 8 instructions per cycle, but it just couldn't do it for typical programs and they had to run multiple threads on the core to be able to keep the pipeline full. Part of the reason why Intel is top-of-the-game now is that they're top-of-the-memory-access-game.

In C++, the program basically specifies the exact order of memory loads/stores, and it takes huge efforts to escape this ordering (compiler alias analysis, out-of-order cpus, weird speculative loads/stores in VLIWs). Multi-threading, SIMD and even GPUs can be viewed as basically mechanisms to make this ordering more flexible.

Higher level languages like Python typically do even more loads/stores/jumps than C++, which makes them even less optimizable (since it's very likely that they are essentially serial, and they let you do crazy tricks that force you to do everything serially). If there's any hope to get a language that's more efficient than C++, I'd say that IMHO it's probably a language that forces a limitation of "absolutely no pointer aliasing" - so probably with no pointers, no references, no side-effects (and probably copy-on-write objects).

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2016-03-11 03:56
Hubert Lamontagne wrote:
You also see stack-like register windows (SPARC, i960, Am29k) and rotating register files (Itanium) but the general consensus seems to be that this is overdesigned and MIPS does just as well with 32 ordinary registers.
A problem with rotating register windows is that the register wheel or stack will overflow when functions are too deeply nested (assuming that you rotate one frame at each function call). You have to keep track of the function nesting level, which may be impossible when you call a DLL that calls another DLL, etc.

If there's any hope to get a language that's more efficient than C++, I'd say that IMHO it's probably a language that forces a limitation of "absolutely no pointer aliasing" - so probably with no pointers, no references, no side-effects (and probably copy-on-write objects).
Is that possible? If you have to copy every array to avoid pointer aliasing, then you lose a lot of efficiency.
   
Proposal for an ideal extensible instruction set
Author: anon2718 Date: 2016-03-13 23:13
One thing I don't tend to see is hardware management of rotating register windows. Or, to look it slightly differently, instead of a straight register-based machine or straight stack-based machine, you have a machine where you can freely access the top <k> elements of the stack for some k.

You have a stack in memory. The contents of said stack area in memory is normally undefined (!), except instructions can read / write the top, say, 32 elements. Or whatever the window size is. Call / return adjust the stack - although I am not sure how much the stack should be adjusted. There are advantages and disadvantages to fixed and variable-sized adjustments, as well as the question of if returns must be matched with calls. There may also be instructions to force an explicit and up-to-date read / write of any element of the stack.

To the processor, treat it as a ring-buffer (or potentially tagged-memory, to allow for easier context switching) cache of the top part of the stack. It speculatively saves anything "below" the current window to the stack to make room for calls, and speculatively loads from the stack to keep the buffer full on returns. As usual here, there is the question of how speculative it should be.

Some architectures have a separate instruction cache - this one has a separate stack cache.

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2016-03-14 02:19
anon2718 wrote:
instructions can read / write the top, say, 32 elements. Or whatever the window size is. Call / return adjust the stack
If it is a rotating register window then it will get filled up when function nesting is too deep. If it is an unlimited stack of register windows then it has to be spilled to memory, which would make calls and returns very slow and would require a separate stack for this purpose only.
   
A design without a TLB
Author: Agner Date: 2016-03-11 05:50
I wonder if it is possible to design a microprocessor without a Translation Look-aside Buffer (TLB). A TLB is a cache that is used for virtual address translation. The TLB is quite big and complicated in many modern processors. Some processors even have two TLB levels. It costs a lot of silicon space and a performance loss because of TLB misses.

One of the most compelling reasons for having virtual address translation in current systems is, as I understand it, that you can run multiple instances of the same program. The multiple instances share the same code segment in order to save memory, but they cannot share writeable data segments. The code segment contains references to the data segment. If multiple running instances share the same code segment then they will also share the same data addresses. The only way to keep the data of different running instances separate is to have the same virtual addresses but different physical addresses for each process.

We can avoid most of the costs of the TLB and virtual address translation by having a special register that points to the data segment of the process. The program code should access its data segment through this pointer. Multiple instances of the same program will have different values in their data segment pointer. This allows them to share the same code segment while having different data segments.

Another use of the TLB is to manage the situation where we are out of physical memory or the memory has become too fragmented. The virtual address translation allows the memory segments to be moved or swapped to a disk. This problem can hopefully be reduced. RAM is cheap today. A well-designed application uses typically 1 MB of memory while a state-of-the-art PC today has 16 GB of RAM or more. Nobody runs 16,000 different applications, not even on a server. Unfortunately, some applications are wildly bloated. This should of course be discouraged. The need for byte-code interpreters, just-in-time compilation and other RAM-hungry frameworks will hopefully be reduced when the instruction set is standardized so that the compiled code will be compatible with many platforms.

We may still need some virtual address translation if we run out of RAM or for the sake of making virtual machines, but it will be a coarse-grained translation with one or a few large contiguous blocks of memory for each process, rather than the fine-grained translation of current systems with a large number of fixed-size memory blocks.

To see how this can be implemented, we first need to get an overview of the different kinds of data used in a running program and how they can be addressed. Traditionally, we have the following kinds of data:

1. Program code (TEXT). This is executable and read-only. Can be shared between multiple processes.
2. Read-only program data (CONST). This contains constants and tables used by the program. It may be shared by multiple processes unless it contains pointers that need to be relocated at load time.
3. Static read/write program data (DATA+BSS). This is used for global data and for static data inside functions. It needs multiple instances if multiple processes are running the same code.
4. Stack data (STACK). This is used for non-static data inside functions. This is the most common and most efficient way of storing program data. Each process or thread has its own stack, addressed relative to the stack pointer.
5. Local heap. Used for dynamic memory allocation by an application program
6. Global heap. Used for dynamic memory allocation by the operating system and device drivers.
7. Thread data. Allocated when a thread is created and used for thread-local static data. Rarely used.

Now, I will discuss how each of these types of data are addressed in current systems and how they can be managed in a system without a TLB.

1. Program code. In current systems, program code may contain absolute addresses. These addresses are modified (relocated) by the loader if the code is loaded at a different address than expected by the linker. The relocation is often avoided by using virtual address translation. Multiple programs that do not need to call each other can be loaded at the same virtual address.
My proposal is to avoid the need for relocation by using relative addresses as much as possible. All addresses within the same code segment are addressed relative to the instruction pointer.

2. Read-only program data. Current systems use either relative or absolute addresses to address read-only data. These addresses often need to be relocated by the loader.
My proposal is to make the read-only data segment contiguous with the program code segment, and access it with addressing relative to the instruction pointer. This needs relocation at the link stage, but not at the load stage.
This segment may contain pointers. This is typically needed in switch/case jump tables, virtual function tables, function pointers and data pointers. Current systems often use absolute addresses in these cases, needing relocation by the loader. Some compilers use self-relative pointer tables or pointers relative to an arbitrary reference point (used in 64-bit Windows and Mac OS).
My proposal for jump tables, virtual function tables and code pointers is to use 32-bit self-relative addresses or addresses relative to the code base.
Tables of constant pointers to data is a problem because - in order to use relative pointers - we need to know whether the pointer target is in the read-only data or the read/write data segment. Preferably, such a table should be placed in the same segment as its targets, either the read-only data segment or the read/write data segment. This makes it possible to use self-relative pointers. If it contains a mixture of both, then it should be placed in the read/write data segment, and any targets in the read-only data segment should be moved to the read/write data segment.

3. Static read/write program data. Current systems use absolute or relative addresses to access read/write data in the same way as read-only data. If multiple processes are running the same program then there will be one instance of the read/write data segment for each process. The multiple instances will typically share the same virtual address, while having different physical addresses. This requires virtual address translation. We want to get rid of this translation.
My proposal is to have a dedicated register for pointing to the read/write data segment. All data in the read/write data segment are addressed relative to the value in this register. We may implement a special addressing mode for this or, alternatively just let the application copy the data segment register to a general purpose register which is used as pointer. Read-only data may optionally be stored here rather than in a separate segment.
The data segment pointer register needs to be saved and restored when one program calls another program. It does not need to be saved when calling a DLL, which I will explain below.

4. Stack. Each process and each thread has its own stack which is addressed by the stack pointer. No problem here.

5 - 6. Heap. You get a pointer when allocating data on a heap. Heap data are addressed through this pointer. No problem here.

7. Thread-local data. Current systems may have a "thread environment block" which contains various information about the thread and a pointer to the thread-local data segment. In x86, it is addressed through a special segment register. It also contains information about stack size, exception handler, process environment, etc.
My proposal is to preserve this system. We may need a dedicated register to point to the thread environment block or to a thread-local data segment.

Dynamic link libraries (DLLs). My proposal is to use Windows-style dynamic linking rather than Linux-style shared objects, because the latter have the rarely used feature of symbol interposition which makes everything less efficient (see www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/ )

I propose that a DLL cannot have a per-process read/write data segment (this might not be thread-safe anyway). If a library needs writeable data, for example for some initializations, then there are three possible solutions:
(1) use static linking, (2) use data supplied by the caller through a pointer, or (3) use global heap data allocated at load time with an absolute address relocated by the loader. This data block will be shared by all processes.

The same applies to device drivers. A device driver may need a writeable data segment for a mutex and for storing information about the device. This data area is shared between all processes. My proposal is that this data block is allocated at load time. It is accessed through an absolute 64-bit address which is relocated by the loader. In case the driver later needs more data, for example if there are many network printers using the same driver, then the device driver can allocate additional space on the global heap and store a pointer to this allocated memory in the data segment it got by the loader. If the driver needs per-process data then it will use data space provided by the caller through a pointer.

These methods will make it possible to replace a TLB with its many small memory segments of fixed size by a memory map with a few large memory segments of variable size. Each process has its own little memory map which is cached in the CPU. The memory map should indicate the type of each memory segment, but not necessarily any address translation. Memory segments of the same type should be joined together and made contiguous as far as possible in order to make the memory mapping as simple as possible. The memory map should not distinguish between executable code and read-only data for DLL's. This will make it possible to join all DLLs together into one big memory segment. Any unused space between the DLLs can be filled with an error code. The same goes with device drivers. Each process will have its own memory map listing the memory areas it is allowed to access. This will normally have only one segment of each type: TEXT, CONST, DATA+BSS, STACK, HEAP+THREADDATA). A segment for DLLs and their constant data can be shared between multiple processes. Each process will be able to see DLLs that belong to other processes, but these are read-only and contain no writeable data so I assume that there is no serious security risk here. A process may use static linking instead if it wants to hide which libraries it is using. The reason why I don't want to hide DLLs from processes that are not using them is that this would split the memory into many small pieces, one for each DLL. This would require many entries in the memory map.

It is my goal to keep the memory map small by keeping similar memory blocks together rather than splitting the memory space into many small pieces of different types.

This may cause problems for programming languages with just-in-time compilation. We will discourage systems and script languages that compile a little piece of code at a time. In fact, the justification for just-in-time compiling disappears when the instruction set is standardized. The code can adapt to different processors with different vector sizes at run time. We only need to (re-)compile code in case a new processor version has new advantageous instructions. The compiler/interpreter should preferably compile the code or script all at once. Piecemeal compilation also causes unpredictable response times which is annoying to the user.

I am undecided about how to implement system calls. It could use absolute addresses or a table of pointers in the read-only data segment or in the thread environment block, or a special system call instruction.

Self-modifying code is discouraged. If an application needs to generate executable code then it should preferably make a DLL and load it before executing it.

Many script languages allow self-modifying scripts. Such scripts should preferably be interpreted rather than compiled. If it turns out that there is a serious need for supporting self-modifying code for applications such as compiling self-modifying scripts, compiling user-supplied macros, or debugging applications, then we may decide to support a memory type that allows both write access and execute access. This write/execute memory will be allocated on a special heap dedicated to this purpose only. Access to use this feature must be restricted in order to avoid abuse by hackers.

Memory model:
The system proposed here gives immediate access to up to 8 GB of code (Jumps and calls use a 32-bit signed offset multiplied by the code word size), 2 GB of read-only data, 2 GB of static read/write data, 2 GB of thread-local data, almost unlimited stack size with 2 GB for each stack frame, and almost unlimited heap space. With such a huge address space, we need not support more than one standard memory model. Everything is accessed through pointers with a 32-bit relative offset.

In the rare case that there is more than 2GB distance between read-only data and the code that reads it, we will use a pointer to access it. This pointer can be stored in the thread environment block.

There is no addressing mode for absolute addresses. In the few cases where we need an absolute address (e.g. data for a device driver), we will load the 64-bit address into a register and use this as pointer. The 64-bit address is inserted into the code by the loader.

Problems:
There may be security problems if we are using a global heap. One process may be able to read and modify data belonging to another process. We should probably avoid using a global heap.

What can we do if the local heap or stack of an application overflows? If the heap overflows, we may make an extra heap that is bigger. This requires an extra entry in the memory map. If the stack overflows, then we need to move it to a different physical address and use virtual address translation. This still requires only one entry in the memory map, but with virtual address translation. The cost is that we have to copy the entire contents to a new physical address. The alternative to copying the entire stack contents is fragmented memory at the cost of having more entries in the memory map.

For these reasons, we cannot completely get rid of virtual address translation, but we can still keep the memory map much smaller than the TLBs of current systems.

   
A design without a TLB
Author: Hubert Lamontagne Date: 2016-03-11 11:06
Afaik, one of the main functions of the TLB is to assist in heap allocation. For allocations that go through the page allocator (>15kb on OSX for instance), it will simply get enough 4k RAM pages to hold the data (all it needs is a large enough contiguous address range in the program's address space to map them to). Remapping memory pages is needed to keep that system working.

Paging also removes the need for segmentation, which is why it's so popular - as far as I know it's still a net gain in simplicity.

I guess the potential avenues for simplification are:

- By making the pages very large, you could perhaps make the page table small enough to fit in on-chip static RAM, which would make the state machine for loading the TLB simpler. Probably not worth the trouble but still an interesting concept.

- You could debate removing page fault exceptions. This would keep the "heap memory management" aspect of paging (and probably the "security" aspect by mapping unauthorized accesses to a dummy page), but would make it impossible to implement virtual memory (aka disk swapping) and other similar tricks (file memory mapping etc). The benefit is that instructions following a load/store are no longer speculative, which could probably be beneficial on some semi-out-of-order architectures (for instance, having load addresses generated on the FPU with an FPU running super late in the pipeline).

   
A design without a TLB
Author: Agner Date: 2016-03-11 12:32
Heap and stack overflow is indeed a problem as I have written above.

If we make an effort to minimize fragmentation, then we can still use a memory map with a few variable-size memory blocks instead of a TLB with a high number of small fixed-size blocks. Modern TLBs are very complicated with multi-level lookup. I am sure there are possibilities for simplification.

   
A design without a TLB
Author: Agner Date: 2016-03-12 00:45
One more suggestion for reducing memory fragmentation. The operating system could make statistics over how much stack and heap space each application uses. Allocate as much space as the statistics predicts + a little more when an application is started. The first time the program runs, it will use the stack size and heap size specified in the executable file header.

I don't think this is a serious burden to put on the operating system, compared to the complicated work of maintaining the large and complicated multi-level tables required by contemporary systems.

   
A design without a TLB
Author: Bigos Date: 2016-03-13 07:35
Hi.

There is another way to reduce the TLB cost, which is used by the Mill architecture [1].

The TLB can be moved from the critical path of L1 cache read to DRAM read. Since DRAM reads are already slow, the TLB doesn't have to be fast, which simplifies it's design. However it means that all data on-chip are virtually addressed. Similarly to your proposal, all processes live in a single virtual address space, but the virtual/physical translation is retained.

The security problem is solved by using a PLB (Protection Lookaside Buffer) which is placed where TLB currently is. Since protection data is only needed to occasionally trigger an exception, it's not on a critical path of L1 read. Mill also employs so called well known regions, which are similar to per-thread/per-process segments and reduce the need to use the PLB in most cases.

Since many operating systems implement a memory mapping commands like linux's mmap, removing the virtual to physical translation would make it very difficult to port such OSs and its applications.

[1] millcomputing.com/docs/memory/ (circa 60th minute)

   
A design without a TLB
Author: Agner Date: 2016-03-28 05:13
Ideas for preventing stack overflow:

In most cases, it is possible to calculate exactly how much stack space an application needs. The compiler knows how much stack space it has allocated in each function. We only have to make the compiler save this information. This can be accomplished in the following way. If a function A calls a function B then we want the compiler to save information about the difference between the value of the stack pointer when A is called and the stack pointer when B is called. These values can then be summed up for the whole chain of nested function calls. If function A can call both function B and function C then each branch of the call tree is analyzed and the value for the branch that uses most stack space is used. If function A is compiled separately into its own object file, then the information must be stored in the object file.

The amount of stack space that a function uses will depend on the maximum vector length if full vectors are saved on the stack. All values for required stack space are linear functions of the vector length: Stack_frame_size = Constant + Factor * Max_vector_length. Thus, there are two values to save for each function and branch: Constant and Factor. We need separate calculations for each thread and possibly also information about the number of threads.

The linker will add up all this information and store it in the header of the executable file. The maximum vector length is known when the program is loaded, so the loader can finish the calculations and allocate a stack of the calculated size before the program is loaded. This will prevent stack overflow and fragmentation of the stack memory. We may also store information about how many threads the program will create. Some programs will use as many threads as there are CPU cores, for optimal performance. It is not essential, though, to know how many threads will be created because each stack can be placed anywhere in memory, but it will make the memory map simpler if all thread stacks can be kept together

In theory, it is possible to avoid the need for virtual address translation if the following four conditions are met:

  1. The required stack size can be predicted and sufficient stack space is allocated when a program is loaded and when additional threads are created.
  2. Static variables are addressed relative to the data section pointer. Multiple running instances of the same program have different values in the data section pointer.
  3. The heap manager can handle fragmented physical memory in case of heap overflow.
  4. There is sufficient memory so that no application needs to be swapped to a hard disk.

Before we rely on this mechanism, we should discuss what can possibly go wrong. Things that can cause problems are:

  • Recursive functions can use unlimited stack space. We may require that the programmer specifies a maximum recursion level in a pragma.
  • Allocation of variable-size arrays on the stack using the alloca function in C. We may require that the programmer specifies a maximum size.
  • Run-time dynamic linking. Dynamic link libraries (DLLs) are usually linked at load time and the loader will be able to include these in the calculation of stack requirements. But a program can need to load and call a DLL at run-time if the choice of DLL depends on user input or if the DLL is called from a script. We may need to guess the required stack size, perhaps based on statistics.
  • Lazy loading. A large program may have certain code units that are rarely used and loaded only when needed. Lazy loading can be useful to save memory, but it may require virtual memory translation and it may cause memory fragmentation. A straightforward solution is to implement such code units as separate executable programs, but this can complicate the exchange of data between mother program and subunits.
  • Script interpreters. Some programming languages are implemented as scripts which are interpreted at run-time rather than compiled. We cannot calculate the required stack size in advance for interpreted scripts. Obviously, it will be more efficient to compile the script if a compiler is available. Self-modifying scripts cannot be compiled.
  • User-defined macros. Macros are similar to small scripts. Depending on the implementation, macros may use heap space or stack space or both, but usually the memory requirement is limited.
  • Many programs running. The memory can become fragmented when many programs of different sizes are loaded and unloaded randomly.

A possible alternative to calculating the stack space is to measure the actual stack use the first time a program is run, and then rely on statistics to predict the stack use in subsequent runs. The same method can be used for heap space. This method is simpler, but less reliable. The calculation of stack requirements based on the compiler is sure to cover all branches of a program, while a statistical method will only include branches that have actually been used.

We may implement a hardware register that measures the stack use. This stack_measurement register is updated every time the stack grows. We can reset this stack_measurement register when a program starts and read it when the program finishes. We don't need a hardware register to measure heap size. This information can be retrieved from the heap manager.

These proposals can eliminate or reduce memory fragmentation in many cases so that we only need a relatively small memory map which can be stored in the CPU chip (Each process will have its own memory map). However, we cannot completely eliminate memory fragmentation and the need for virtual memory translation because of the complications discussed above.

   
Proposal now published
Author: Agner Date: 2016-03-22 10:51
Thank you everybody for all your inspiring comments to my "Proposal for an ideal extensible instruction set". I have now worked everything together and made a more detailed proposal. It is published at www.agner.org/optimize/instructionset.pdf

I have designed a consistent code structure where everything fits nicely. All instruction forms and addressing modes fit into the same template. All immediate constants and address offsets have power-of-2 sizes and proper alignment. The code word size is 32 bits, and each instruction can use one, two or three words. Each instruction can be coded in many different versions with different operand types, addressing modes, options and features. Simple common instructions can be packed in a tiny format with two tiny instructions stuffed into one 32-bit code word, but the 4-byte alignment of the code is maintained.

The idea of variable-length vector registers fits excellently with the design goals. The same executable program can run optimally on different microprocessors from small office computers and tablets to large scientific supercomputers with very long vector registers, without the need for separate compilation for each platform.

The instruction set has no name yet. I have considered calling it CRISC, because it combines the best from RISC and CISC. The modular format with easy detection of instruction length makes decoding simple and fast. Instructions have a moderate degree of complexity. An instruction can do multiple things, but only if it fits into the pipeline structure so that it does one thing at each pipeline stage. This will assure a throughput of one instruction per clock cycle per pipeline lane (except for division and cache misses). There is no need to split instructions into micro-operations or to use microcode. My ambition is to design a system that can outperform the best existing microprocessor designs.

My proposal includes standardization of the entire ecosystem of ABI standard, binary file format, function libraries, compiler support and OS support. With open standards for the entire ecosystem we would be able to combine different programming languages in the same program and use the same function libraries with all compilers.

Isn't that a nice vision? I am looking forward to your intelligent comments.

   
Proposal now published
Author: Hubert Lamontagne Date: 2016-03-23 20:03
Really nice doc! A question and a couple suggestions:

I kinda wonder how the OS would handle the following case:
- Load program 1 to offset 0010 0000h
- Load program 2 to offset 0020 0000h
- Load program 3 to offset 0030 0000h
- The user unexpectedly loads a 300mb file in program 2, which causes a surprise 300mb allocation on the heap. Where does the OS place this allocation?

--------

One suggestion: remove "Indexing into the register file".

Rationale:

The implementation of register file indexing on an out-of-order CPU looks like this:

- Option #1: Stall hard at the register rename stage until the indexing value becomes valid and can be read. Then read the value through a special feedback path from the register file/bypass network/data cache up to the renamer. If the index is loaded from memory, this will stall for at least 3 or 4 cycles even in the best case. Resume execution.

- Option #2: Speculatively assume that the indexed register will not be the same as any other subsequent operation. Record the previous value of every single rename in a special queue or register file after this until the indexing value can be read (through the special feedback path), or alternatively backup the whole renaming register file. When the indexing resolves, check that no reads/writes have been done to the resolved register. If there are any reads, trigger a branch prediction fail and restore the CPU state to the last known valid state.

- Option #3: Have a specialized predictor that remembers which indexed register reads/writes cause branch prediction fails. Use option #2 by default, except for indexed-register-reads/writes that trigger fails which use option #1.

Even on in-order CPUs, this is a problem because this defeats the bypass network for register reads - you can't bypass a register that you can't figure out what it's going to be yet! Having to stall multiple cycles to avoid potential hazards make register-file indexing very costly, which defeats the purpose.

--------

2nd suggestion: Make operand size for tiny format 32bits, not 64bits (except mov). Consider trading tiny setbit/clearbit/xor/reversesubtract/andnot for arithmetic shift right and 64bit add/sub. Consider adding the option for 32bit signed indexes in address calculations or sign extending instead of zero extending 32bit operations to 64bit by default.

Rationale:

32bit integer operations are extremely extremely common in C/C++ code, due to how the LLP64 (win64) and LP64 (OSX+Linux) models work. Almost only pointers will be 64bits in a typical program (and the remaining cases are when size_t is used instead of int for loop indexes), and typically the only operation done on pointers are add, sub and compare. This also affects indexing - some_pointer[signed_int32_offset] tends to show up very often in code, and you probably don't want the compilers to add a sign extension operation every time.

   
Proposal now published
Author: Agner Date: 2016-03-24 01:46
Hubert Lamontagne wrote:
- The user unexpectedly loads a 300mb file in program 2, which causes a surprise 300mb allocation on the heap. Where does the OS place this allocation?
The heap doesn't have to contiguous - the stack does. In case the heap overflows, there are three options:
1. Allocate more heap space somewhere else. Make an extra entry in the memory map for it.
2. Same as 1. Use virtual address translation to keep it contiguous in order to make the heap manager simpler. This requires that you allocate a lot of unused virtual address space for each program as it starts. This method works also for the stack.
3. As a last resort, when memory space has become hopelessly fragmented and you are out of memory map entries. Swap the data of the least used program to disk. Reorganize the fragmented data by actually moving the values to make them contiguous in physical memory (assuming that they were already contiguous in virtual address space). This will cause an annoying delay to the user, but we already have such delays in current systems when you run out of memory. You may signal a warning to the user: Memory low, please close some programs.

One suggestion: remove "Indexing into the register file".
Thank you for pointing out the difficulties here. Several people have proposed things like a reconfigurable register space with arbitrary allocation of vectors in this space, and I thought that a register index was simpler.

I want to avoid complex instructions for saving and restoring all or many registers. Any suggestions for an alternative to writing 64 consecutive save instructions or having a complex microcoded save-all instruction?

2nd suggestion: Make operand size for tiny format 32bits, not 64bits (except mov). Consider trading tiny setbit/clearbit/xor/reversesubtract/andnot for arithmetic shift right and 64bit add/sub. Consider adding the option for 32bit signed indexes in address calculations or sign extending instead of zero extending 32bit operations to 64bit by default.
I am not sure if you are trying to save power by doing a 32-bit addition instead of 64 bits. The hardware may actually disable the upper part of the heavy carry-lookahead circuit if it can detect quickly that the values are small. Otherwise, I don't see the problem of using 64-bit addition on 32-bit values. The value will simply be truncated if it is later used in a 32-bit (not-tiny) instruction. The compiler will recognize the need to sign-extend a 32-bit signed index or optimize the program by replacing it with a 64-bit index variable. Or you may use an unsigned 32-bit index and use the carry flag for detecting end of loop if the index is counting down. I would rather keep array indexes 64-bit because setting a 2 GB size limit to arrays is a problem for the programming language standard, and the value may cross zero when a count-down loop ends.
   
Proposal now published
Author: Hubert Lamontagne Date: 2016-03-24 12:20
| 1. Allocate more heap space somewhere else. Make an extra entry in the memory map for it.

That would work, although then the memory map would potentially need to be somewhat complex and probably have
to be cached - this is almost a TLB already, the only missing feature is the ability to remap memory blocks.

| 2. Same as 1. Use virtual address translation to keep it contiguous in order to make the heap manager simpler.
| This requires that you allocate a lot of unused virtual address space for each program as it starts. This method
| works also for the stack.

This is exactly what TLBs do!

| 3. As a last resort, when memory space has become hopelessly fragmented and you are out of memory map entries.
| Swap the data of the least used program to disk. Reorganize the fragmented data by actually moving the values to
| make them contiguous in physical memory (assuming that they were already contiguous in virtual address space).
| This will cause an annoying delay to the user, but we already have such delays in current systems when you run out
| of memory. You may signal a warning to the user: Memory low, please close some programs.

Amazingly, this approach exists on real hardware: pre-OSX Mac OS works this way. It has a memory compactor and
all memory allocations must use memory handles instead of pointers. When RAM runs out, the OS recopies the whole
RAM and removes all dead allocation and updates all memory handle addresses. The reason for this is that the first
Macs had no MMU (!) so this was the only approach that could work, but this created long lasting damage to the
platform: there was a whole system for locking down handles (when doing system calls and so forth), and the whole
thing was only fixed with OSX. 16 bit Windows also has this issue (they figured out a way to fix it in Win32).

Java also typically has memory compaction and doesn't require an MMU, but it pays a price for it: it has unavoidable
garbage collector pauses, often taking upwards of 100ms (which is one of the reasons why Minecraft is a bit jerky).
Then again, Java is typically used for server software and enterprise/government stuff, where pauses don't really matter
as much as dealing with second rate programmers that can't use malloc()/free() properly.

------

| I want to avoid complex instructions for saving and restoring all or many registers. Any suggestions for an alternative to
| writing 64 consecutive save instructions or having a complex microcoded save-all instruction?

For general purpose registers, you're going to do 32 consecutive loads/stores so it's always going to take 'some' time.
I'd suggest mandating 16-byte stack alignment in the ABI, and having store-dual and load-dual instructions that can
save/restore two registers in a single whole 128bit memory access. This is what ARM64 does. ARM already has to deal
with other instructions with 3 sources or 2 destinations anyways.

Other RISCs simply use individual instructions for each register, because normally you do register saving/restoring
when doing branches and interrupts and other moderately slow things, so the pipeline is likely to stall for all sorts of
other reasons (branch prediction miss, data cache miss, instruction cache miss, chains of indirect loads, having all
downstream operations depend on a single load so that the whole pipeline is latency-limited, etc), so the cost of
saving/restoring the regfile is lost in the wash.

For vector register save/restore, you're dealing with large vector registers which is going to take multiple memory cycles,
so it almost doesn't matter how you do it! The cost of keeping the data cache busy for multiple cycles for each vector
register load/store completely dwarfs the cost of issuing 32 loads/stores instead of less. It's also possible to have
instructions that load/store to multiple contiguous SIMD registers - Arm NEON has tons of this, including strange stuff
like interleaved loads.

------

| I am not sure if you are trying to save power by doing a 32-bit addition instead of 64 bits. The hardware may actually
| disable the upper part of the heavy carry-lookahead circuit if it can detect quickly that the values are small. Otherwise,
| I don't see the problem of using 64-bit addition on 32-bit values. The value will simply be truncated if it is later used
| in a 32-bit (not-tiny) instruction. The compiler will recognize the need to sign-extend a 32-bit signed index or optimize
| the program by replacing it with a 64-bit index variable. Or you may use an unsigned 32-bit index and use the carry
| flag for detecting end of loop if the index is counting down. I would rather keep array indexes 64-bit because setting
| a 2 GB size limit to arrays is a problem for the programming language standard, and the value may cross zero when
| a count-down loop ends.

I'm not suggesting it for power-saving reasons, I'm suggesting it for C++ compatibility reasons!

You're right: 64bit MOV, ADD, SUB, AND, OR, XOR, SHL can stand in for 32bit versions. This does not work for SHR,
arithmetic SAR and comparisons though. MUL is a bit of a wash - 64bit MUL can stand in for 32bit MUL, BUT 64bit
MUL is kinda slow to implement in hardware due to the large number of bits involved and large number of partial sums,
so a 32bit MUL still makes sense. This is why ARM ended up with a 16x16->32 MUL in its instruction set - it's strictly
there because it can run faster than 32x32->32 MUL.

The compiler will probably have to add an instruction to sign-extend all 32bit indexes. It cannot replace 32bit variables
with 64bit variables because that changes behavior - instead of 'int' being "32bits", it becomes "32bits except if the
compiler decided to make it 64bits but it can still revert to 32bits at any time depending on if the value is in a register
or in RAM". Technically you could make int 64bit, and some early 64bit cpus did this, but this is bad because it eats
twice as much data cache for no good reason.

IRL, >2GB arrays are as rare as hens teeth and if they do happen, they probably already need complex management
for other reasons. This is not really due to constraints of current memory sizes... it's more of a result of the fact that
you practically never need "2 billion of something".

   
Proposal now published
Author: Agner Date: 2016-03-24 14:50
Thanks for your comments. I know that I can't completely get rid of memory fragmentation and virtual address translation. I am just hoping to keep the number of memory sections and fragments so small that we can get rid of the fixed size memory pages and have a limited number of variable size memory blocks instead.

The instruction set makes sure that the data segment can be placed anywhere. It doesn't have to be adjacent to the code segment. Another thing is getting rid of the many DLLs or shared objects with each their code and data segment, but instead joining all DLL code segments into one.

Hubert Lamontagne wrote:

Any suggestions for an alternative to writing 64 consecutive save instructions or having a complex microcoded save-all instruction?
For general purpose registers, you're going to do 32 consecutive loads/stores so it's always going to take 'some' time.
I'd suggest mandating 16-byte stack alignment in the ABI, and having store-dual and load-dual instructions that can save/restore two registers in a single whole 128bit memory access.
I did consider 16-byte stack alignment for the sake of better alignment of vectors. But it will waste stack space on every call. Double push/pop instructions would have to be optional because I am trying to keep complexity down.

One solution is to have "save register and increment pointer" instructions in tiny form for both integer registers and full length vectors. Same for restore. Then you can save everything with 64 tiny instructions = 128 bytes of code.

About 32-bit index.

The compiler will probably have to add an instruction to sign-extend all 32bit indexes. It cannot replace 32bit variables with 64bit variables because that changes behavior - instead of 'int' being "32bits", it becomes "32bits except if the compiler decided to make it 64bits but it can still revert to 32bits at any time depending on if the value is in a register.
I just made an experiment to see what compilers actually do with a signed 32-bit index variable in 64-bit mode (x86-64). Both Ms and Gnu compilers simply replaced my 32-bit integer with a 64-bit integer. I don't think it is worth the effort to make a special addressing mode with sign-extended 32-bit index.
   
Proposal now published
Author: Hubert Lamontagne Date: 2016-03-24 18:45
| I am just hoping to keep the number of memory sections and fragments so small that we can get rid of the fixed size
| memory pages and have a limited number of variable size memory blocks instead.

I'm really curious about how the hardware would be able to do that mapping in a very short time (ideally in a single
cycle, and in a way that the L1 cache data lines that get read don't depend on virtual address translation).

------

| Double push/pop instructions would have to be optional because I am trying to keep complexity down.

I feel that your proposal isn't that aggressive in terms of keeping complexity down (compared to a MIPS). It has many
other parts that would be more complex or would require microcode, so I'm not sure that the extra complexity in that
one place would be that salient.


| One solution is to have "save register and increment pointer" instructions in tiny form for both integer registers and full
| length vectors. Same for restore. Then you can save everything with 64 tiny instructions = 128 bytes of code.

Well, "load register and increment pointer" writes to 2 registers so it's similar in complexity to "load dual". Squeezing
a few very common types of memory loads in tiny instruction still makes sense though.

------

| I just made an experiment to see what compilers actually do with a signed 32-bit index variable in 64-bit mode (x86-64).
| Both Ms and Gnu compilers simply replaced my 32-bit integer with a 64-bit integer. I don't think it is worth the effort to
| make a special addressing mode with sign-extended 32-bit index.

I just checked msvc's x64 output, and I'm admit I'm very spooked. It does stuff like loading an index into eax, then using
rax as an array index, even though the variable is clearly signed in the code. Some code locations seem to use movsx for
sign extension, but most don't. The compiler seems to guess when the variable can never go negative, somehow. In one
place, it seems to convert from eax to rax near the beginning of a function, then use rax throughout the body of the
function. Maybe it's using the part of the C/C++ standard that says that integer overflow is "undefined".

This also affects Java (which strongly defines "int" as 32bits and signed, and all its operations as 32bit signed operations)
and there is at least one paper about figuring out how to remove as many sign extensions as possible in 64bit mode.

   
Proposal now published
Author: Agner Date: 2016-03-25 05:04
Hubert Lamontagne wrote:
I'm really curious about how the hardware would be able to do that mapping in a very short time (ideally in a single
cycle, and in a way that the L1 cache data lines that get read don't depend on virtual address translation).
The virtual address translation is just an adder in the memory map that I am envisaging, rather than the multi-level table lookup of a traditional TLB. You may put virtual address translation after the L1 cache to make cache access faster.

The memory map is saved and restored on a task switch since there will be a separate memory map for each process.

It has many other parts that would be more complex or would require microcode.
I hope not. I would rather have more complexity in the pipeline and perhaps dedicated state machines to things like interrupts and system calls rather than using microcode. Microcode seems to be incredibly slow in the processors I have tested, though I don't know exactly why.

Well, "load register and increment pointer" writes to 2 registers so it's similar in complexity to "load dual".
Pop dual will write 3 registers, including the stack pointer. I think a fixed limit of two output registers is fair. We need that for flags output anyway.

I just checked msvc's x64 output, and I'm admit I'm very spooked. It does stuff like loading an index into eax, then using rax as an array index, even though the variable is clearly signed in the code. Some code locations seem to use movsx for sign extension, but most don't.
In my experiment, the MS compiler sign-extended the index outside a loop, the Gnu compiler used zero-extension. Gcc is (in-)famous for interpreting standards in a very pedantic way. There is probably some C standard saying that a negative index to a pointer or array is undefined.
   
Proposal now published
Author: Hubert Lamontagne Date: 2016-03-28 19:20
The multi level tlb looks complex, but it has fewer corner cases than having an extra adder in the address translation : you can use a physically indexed tlb in parallel with cache lookup is the same size as your cache line way - this is why CPUs with 2-way 8kb L1 (4kb per way) use 4kb MMU pages on x86!

For task switching, it's probably easier to change the master page table address (CR3 on x86) and clear the tlb than to do a real task switch. Or alternatively, you can have a tlb tagged per process and then you just need to change the page table base address + process ID register!

The way MIPS stays simple is that they disallow any sort of multi step situational thing with lots of changing registers and multiple memory accesses... No push/pop, no call/return, no automatic loading of selector offsets and segment size like on 286, and especially no task switch instruction. They even have a software tlb! That's also why it has the separate multiply result register - not for the high/low 32x32->64 thing but rather to avoid dealing with the really long latency result.

This is also why RISC-V is designed that way: the idea is that by having no multi result instructions, but having higher throughput because it's easier to make a CPU with out of order execution and lots of execution units, you still come ahead in overall speed.

Push/pop has the disadvantage of reupdating the stack pointer every time, instead of once for the whole group of loads/stores (the MIPS way). So I was suggesting dual load/store, not dual push/pop.

I think the voodoo GCC uses is that c++ specs say that overflowing integers are "undefined" - if you promote your int32s to int64 but never overflow them the result is the same!

   
Proposal now published
Author: Agner Date: 2016-03-29 02:11
Hubert Lamontagne wrote:
The way MIPS stays simple is that they disallow any sort of multi step situational thing with lots of changing registers and multiple memory accesses... No push/pop, no call/return, no automatic loading of selector offsets and segment size like on 286, and especially no task switch instruction.
This is also why RISC-V is designed that way: the idea is that by having no multi result instructions, but having higher throughput because it's easier to make a CPU with out of order execution and lots of execution units, you still come ahead in overall speed.
Why is it so much more difficult to have two-result instructions? We need two-result instructions for: add-with-carry, overflow detection, pop register, return, read-and-increment-pointer, ALU-and-conditional-jump.
If we want an efficient implementation of add-with-carry with a single register, we may have an extra bit in the register for this purpose. Or we may use every second element in a vector for carry bit, at the cost of getting half the work done. The same for overflow detection.

Function calling also becomes complicated if we do not allow multiple results. The call of a non-leaf function will be: (1) copy instruction pointer to link register, (2) jump to target, (3) subtract from SP, (4) save link register on stack. And the return will be: (5) recall link register from stack, (6) add to SP, (7) jump to link register. That is seven instructions instead of two. Does the increase in speed really make up for that?

   
Proposal now published
Author: Hubert Lamontagne Date: 2016-03-30 02:46
Agner wrote:
Hubert Lamontagne wrote:
The way MIPS stays simple is that they disallow any sort of multi step situational thing with lots of changing registers and multiple memory accesses... No push/pop, no call/return, no automatic loading of selector offsets and segment size like on 286, and especially no task switch instruction.
This is also why RISC-V is designed that way: the idea is that by having no multi result instructions, but having higher throughput because it's easier to make a CPU with out of order execution and lots of execution units, you still come ahead in overall speed.
Why is it so much more difficult to have two-result instructions?
I admit, it's not that much more difficult to have two-result instructions. But it has a cost: a 4-issue CPU ends
up potentially writing to 8 registers per cycle. This means that you need a register file with 8 write-ports, which
takes up more space and has a higher latency. Your register renamer also needs 8 write-ports instead of 4,
and the potential number of conflict scenarios that have to be broken down at the issue stage goes up. If you
have a pentium-pro style pipeline where results are committed to a permanent register file in-order, this also
goes up from 4 write ports to 8. If you have an R10000 style pipeline where the permanent register file is only
for renaming, then you have a different but similar problem: you have to queue 8 now-reusable registers to
the register renamer per cycle instead of 4.

You could also limit the number of instructions you issue if they take up too many write ports - for instance, if
you have 6 write ports, you can check on each cycle if you're going to use up too many write ports and only
issue 3 instructions to prevent the 7-or-8-writes-on-a-single-cycle scenario described above. But then you
need more arbitration circuitry and your potential benefit from multi-result instructions goes down.

I guess it all comes down to what's your limiting factor:
- If your limiting factor is issue width (how many different instructions per cycle you can issue), then multi-result
instructions are good because you're doing more work out of your few available instructions. x86 tends to
fall in this case due to the whole instruction length business, and good x86 designs tend to do lots of work
from few instructions (AMD's 3-issue Athlon is a perfect example of this - and a good example of "fast CISC").
- If your limiting factor is register-file and rename ports, then multi-result instructions are bad because they
won't be faster than multiple instruction sequences and they make the pipeline more complex overall (since
the multiple results have to be committed together and so forth).
- If your limiting factor is L1 cache read and write ports, then it's all a wash.

We need two-result instructions for: add-with-carry,
I don't think ADC is used often enough to warrant being included in a general purpose instruction set. It makes
perfect sense for 8bit and 16bit processors, but for 32bit processors, you rarely - if ever - do 64bit calculations.
This counts double for 64bit processors: ADC only ever appears if you want to do 128bit calculations (or larger),
which is even less common, and it only saves 2 instructions and 1 cycle latency over the equivalent MIPS sequence
(using compare-and-set-to-1-if-larger-or-equal to generate carry and adding it in separately).

overflow detection,
Careful use of comparison instructions handle this case adequately, as far as I can tell. For instance, when doing
unsigned addition, you only have to compare the result with a source operand: if the result is smaller, you have a
100% guaranteed wrap. Or, since most integer operations happen on 32bit ints, you can generate an oversized
64bit result and check for overflow separately. Furthermore, these overflow checks happen outside of the critical
path, so unless the instruction stream is saturating the CPU's ALUs, these checks are basically free.

The other option is to generate trap interrupts on overflows, but this generally can't be used in high-level language
interpreters (too coarse grained, hard to recover from an interrupt), or in C++ (no support, programmers expect
int32_t calculations to wrap), and it's not particularly useful in ASM either (more speed-oriented than security-oriented).

pop register,
Ok, this one is actually fairly common in general purpose code. The MIPS equivalent is pretty okay too though: load
register + increment sp. This is especially OK if you pop multiple registers at the same time, in which case all the
sp increments can be combined together into one large increment at the end and the CPU doesn't have to keep
track of the intermediate values of sp. It generates a sequence like this:
ld r4 [sp + #0],
ld r5 [sp + #4],
ld r6 [sp + #8],
ld r7 [sp + #12],
add sp #16

On out-of-order CPUs, this might run faster than 4 pop instructions because it typically generates 5 micro-ops
instead of 8. Other times, the execution speed of that sequence is limited by L1 data cache ports so there's no
speed difference.

If you only have to pop a single value, then the splitting hurts more, but then your function is likely to be inlinable or
a leaf-function.

return
Return falls into more or less the same case as pop register: it's fairly common, enough to warrant special
consideration, but the MIPS sequence (ld r31 [sp + #x], add sp #4, jmp r31) also handles it well (since
loading/storing the link register can be combined with the rest of the loads/stores to stack so it often
doesn't generate any extra sp updates). Doing it with a single complex instruction rather than multiple simple
ones often doesn't generate much win since the number of overall memory operations and state changes
doesn't change ('return' afaik is always a 2 or 3 micro-op instruction on x86).

read-and-increment-pointer,
That case is similar to pop register: on one hand, you get two results for the price of one if your front-end can
only generate a limited number of instructions (which is why ARM has this instruction), but on the other hand,
separating reading and pointer-increment lets you combine a whole bunch of updates to the same pointer
together, which is often good since it reduces the number of intermediary results for the pointer value (and
removes the false dependency between multiple consecutive reads to the same incremented pointer).

If you look at later ARM cpus, read-and-increment instructions often have speed penalities since the
underlying CPU can't really handle the extra generated values (too few register write ports etc) so there's
often very little gain over using separate load and increment instructions.

ALU-and-conditional-jump.
That one is more interesting because it's not really a multiple-result instruction... The ALU result and jump
go to different parts of the retirement unit (regfile writeback and branching respectively), which is why
combined compare+branch appear in many RISC instruction sets (including MIPS).

If we want an efficient implementation of add-with-carry with a single register, we may have an extra bit in the register for this purpose. Or we may use every second element in a vector for carry bit, at the cost of getting half the work done. The same for overflow detection.

Function calling also becomes complicated if we do not allow multiple results. The call of a non-leaf function will be: (1) copy instruction pointer to link register, (2) jump to target, (3) subtract from SP, (4) save link register on stack. And the return will be: (5) recall link register from stack, (6) add to SP, (7) jump to link register. That is seven instructions instead of two. Does the increase in speed really make up for that?

Step (1) and (2) are typically a single instruction (since one retires to the regfile and one retires to the
branch unit), so it's not a problem. Function calls are likely to generate a whole bunch of memory
loads/stores - typically to object member variables in C++ - so it's very likely there will be a free pipeline
ALU slot for the SP updates (3) and (6), and as stated above you only need a single SP update no
matter how many registers you're loading/storing. If a cache miss or branch misprediction happens
(which is probably most likely near function starts and ends), then you'll likely have dozens free ALU
cycles that you can use to deal with SP. Another common case is that function call/returns often have
series of instructions limited by data cache latency (ex: loading a pointer, then using it to read from RAM),
in which case you also have many 'free' ALU cycles.
   
Proposal now published
Author: Agner Date: 2016-03-30 11:31
Hubert Lamontagne wrote:
it all comes down to what's your limiting factor:
- If your limiting factor is issue width (how many different instructions per cycle you can issue), then multi-result
instructions are good because you're doing more work out of your few available instructions. x86 tends to
fall in this case due to the whole instruction length business, and good x86 designs tend to do lots of work
from few instructions (AMD's 3-issue Athlon is a perfect example of this - and a good example of "fast CISC").
- If your limiting factor is register-file and rename ports, then multi-result instructions are bad because they
won't be faster than multiple instruction sequences and they make the pipeline more complex overall (since
the multiple results have to be committed together and so forth).
- If your limiting factor is L1 cache read and write ports, then it's all a wash.
I believe the limiting factor will most likely be cache bandwidth and memory bandwidth. That's why I want moderately complex and compact instructions.

We need two-result instructions for: add-with-carry,
I don't think ADC is used often enough to warrant being included in a general purpose instruction set.
add-with-carry is used mainly for high-precision math. This is typically big number-crunching algorithms where performance is critical.
overflow detection,
Careful use of comparison instructions handle this case adequately, as far as I can tell. For instance, when doing unsigned addition, you only have to compare the result with a source operand.
Overflow detection with signed integers is a mess.
pop register,
Ok, this one is actually fairly common in general purpose code.
Well, I have changed my opinion on this one :-)
Push and pop will rarely be used in critical parts of the code if my ABI proposals are met. We don't need push for function parameters because we have assigned 16 integer registers and 16 FP/vector registers for function parameters. We could even assign more registers, if needed. Push and pop for register spilling can be kept out of the critical innermost loops and functions if we implement my idea that the register use of functions should be reported in object files.

The problem with push and pop is that a push or pop that is waiting for an operand can delay all subsequent stack operations unless the instruction is split into micro-operations or a special stack prediction mechanism is implemented. So I think that we don't need push and pop instructions at all, but I still want to have call and return instructions that use the stack (for the reasons I have explained in my document). A direct call and return cannot delay subsequent stack operations, but an indirect call can, if it is not split into micro-operations.

read-and-increment-pointer,
That case is similar to pop register: on one hand, you get two results for the price of one if your front-end can
only generate a limited number of instructions (which is why ARM has this instruction), but on the other hand,
separating reading and pointer-increment lets you combine a whole bunch of updates to the same pointer
together, which is often good since it reduces the number of intermediary results for the pointer value
The only reason I want read-and-increment-pointer is to make it easy to save all registers. A "read-and-increment-pointer" can fit into a tiny instruction, while you will need a doubleword (64 bit) size instruction to save or restore integer registers with a non-moving pointer, and you will need two instructions to save or restore each variable-size vector register. So this is for compactness, not for speed. It may be bad for the speed if a compiler uses the read-and-increment-pointer instruction for ordinary loops, unless the instruction is split into micro-operations. Other forms of loop will be faster than read-and-increment-pointer.
   
Do we need instructions with two outputs?
Author: Agner Date: 2016-03-31 01:55
An alternative to two register outputs for add-with-carry is to have extra carry bits in the register. We could, for example, have one extra flag bit for each 32 bits of vector registers. The extra bit can be used for carry, overflow, masks, etc. In this way we can implement add-with-carry with just two inputs and one output. Nobody needs add-with-carry for 8-bit and 16-bit integers. Maybe it is worth the cost of having 3% more bits of vector storage if we can avoid instructions with two outputs? But now we have a problem with saving the vector to memory. We will need two different save instructions - with and without the extra flag bits. This will cause complications in the compiler. And, do we need an extra bit on integer registers as well for the sake of orthogonality?

We can live without pop instructions. If we can handle the problems of an extra flag bit, then the only need for instructions with two outputs is read-and-increment-pointer. Without this, how do we save all registers in a task switch in an efficient and compact way? The extra bits will make it still more complicated to save registers for task switch. Any suggestions?

BTW, if we can accept extra bits in a vector register, we can also have extra bits for vector length. This will remove the input dependency on a vector-length register.

   
Do we need instructions with two outputs?
Author: Hubert Lamontagne Date: 2016-04-01 02:46
I've thought about the "assigning a flag register for each register" thing a bit, and its problem is not only that saving/restoring by the OS on interrupts, but also that it makes all callee-saved registers in the calling convention double-expensive - because the leaf function not only has to save the register value, but also its matching flags. You get a conflict between the principle that "flags aren't saved on function calls" (because they get wiped too easily and you pretty much never want to preserve them anyways) and the principle that "flags should have the same calling convention as their matching registers" (since they presumably come from the same register rename and update together). It also does all sorts of weird things like change 'nops' into different operations (mov r0, r0; add r0, #0; and r0, #-1; and the st r0, [sp+0] + ld r0, [sp+0] sequence all become different due to flag effects). Mov preserves flags (so that you can make 'mov' happen 'for free' by doing it at the register rename stage) but mov immediate doesn't. This is bad for late compiler optimization passes (because it has to take flags into account). So I think having flags be a part of value registers creates more problems than it solves.

If Bignums are really important and we need fast ADC for stuff like RSA encryption, I suggest we should make ADC/SBC vector-unit only, and have specific flags registers attached to the vector unit (to reduce the number of dependency paths from the SIMD unit to the main unit). Also, I'd separate the flags that are a result of the SIMD operations (ie carry) from the flags that control SIMD operations (zeroing, denormal-zeroing, etc), so that SIMD operations that update flags can simply wipe the previous value - Partial flag register updates are bad since it requires separate flag rename engine for every part that doesn't update together!

The exception to this is vector length (for variable vector length), which has to be on the integer-scalar unit because it has to be known by load/store instructions.

-

Ideally, it's best if SIMD instructions cannot cause interrupts and can't affect program flow. For most architectures, SIMD operations execute on a different unit with a different issue queue, so it's best if non-SIMD and SIMD operations can be separated as fast and easily as possible - basically right after determining instruction size, since operations go to different queues and compete for different register ports and so forth. In theory, you could even design a cpu with a separate IP and instruction cache for the SIMD unit, and do all SIMD loads/stores through a queue (the PS3's Cell is sorta like this, in a way).

For instance, on the ARM Cortex A8, NEON/FPU instructions literally CANNOT cause interrupts since the non-SIMD instruction results don't even commit at the same time, so SIMD instructions and every subsequent instruction have to be 100% sure to run (because the result of subsequent non-SIMD has already been committed so it cannot be undone). The benefit of this is that the non-SIMD commit unit doesn't even have to know that the SIMD unit even exists except for receiving store values in a queue, and the SIMD unit likewise knows nothing about the non-SIMD unit except that instructions and values loaded from memory and forwarded from GPRs arrive in a queue and that store values go in a queue.

X86 enforces this less strongly (so that committing has to be synchronized between the general-purpose unit and the SIMD unit) but even then, there's a reason why, on the Athlon, COMISS (SSE float compare and update main unit flags register) runs on the Vector Path (microcode!) - and it's not because AMD engineers thought people wouldn't use COMISS.

Basically, I don't think orthogonality between non-SIMD instructions and SIMD instructions is a good idea, since they have different goals: non-SIMD instructions have to have as few weird side effects as possible and retire as fast as possible, so that they can be renamed and reordered and jumbled and rolled-back if prediction has failed (or a load/store caused a page fault). SIMD instructions just have to do as much math as possible per cycle - they're all about throughput, so it doesn't matter if they take 4 or 5 cycles to complete, if they can't be reordered and so forth - which is why VLIW is popular in cpus designed for DSP (they don't have to run C++!).

SIMD-oriented code also tends to be more likely to be simple short loops so I don't think it really has to be particularly compact. Also the memory bandwidth usage for instructions will probably be totally dwarfed by the memory bandwidth usage for data in SIMD code anyways.

-

I don't think saving/restoring the register file on task switch using 32 consecutive loads/stores (+sp update) is THAT much of a problem because task switches cause other much slower side effects - for instance, you're likely to get a whole bunch of instruction cache misses and data cache misses and TLB evictions and TLB misses and branch prediction misses and the cache prefetcher getting confused - those are many times more costly.

To handle interrupts, you do need a few scratchpad registers that are only accessible to the OS, for saving the previous values of SP + IP + OS/user mode + a couple extra registers. This is to get work space to "bootstrap" the interrupt handler state save/restore. Early MIPS had the problem that it didn't really have those system-reserved special registers, so you unfortunately lost a couple of general purpose registers instead.

You also probably need the hardware TLB to have different memory mappings for OS and user and switch automatically between those. Another way to deal with this is having a few banked registers (typically SP and the last few registers - just enough to initiate state saving). Even though this makes interrupt handler prologues kinda ugly, it also removes the need for microcoded interrupt handlers (which are often somewhat bypassed by the OS anyways).

   
Do we need instructions with two outputs?
Author: Agner Date: 2016-04-01 03:49
Thank you for your comments. It is nice to have sparring partners to discuss with. We are really getting somewhere.

I think there are many advantages to storing the vector length in the vector register itself. This makes it much cheaper to save callee-save registers: You only have to save the part of the register that is actually used. I plan to make special instructions for save/restore. These instructions will save or restore the length and as much of the data as is actually used. The save instruction can increment a pointer by the actual amount of space used. The restore instruction will have to be followed by an extra instruction to increment or decrement a pointer by the amount of space used. The same instruction can be used to adjust the stack pointer before saving. This will save a lot of data cache. If a very long vector register actually contains only a floating point scalar then we need not save any more than that. If the vector register is unused, which will probably happen more than 50% of the time, then we only need to save the length, which is zero. Also, we don't need a complicated lazy save mechanism for task switches. And we get one less input dependence because we don't need a separate register for vector length.

The disadvantage is, of course, that the compiler needs to distinguish between saving the data of a vector and saving everything for a later complete restore. Once we have this distinction, however, there is little extra cost to also saving a few carry flags. You want to keep the carry bits separate from the control flags. I agree. My idea is to use the carry bits for overflow detection (so that we can avoid interrupts in vector instructions), and use a flags register, as already present in my initial proposal, for masking and for a lot of option bits that we don't have room for in the instruction code. A few of these option bits will determine the use of the carry flag, i.e. whether it is unused or it detects signed or unsigned integer overflow, floating point overflow or invalid operations or both, and whether it is propagated from input vector operands to output operands. All of these features are only activated if the software actually has a try/catch clause to detect overflows. Otherwise, the compiler doesn't need to care. But you want to avoid instructions with two outputs, and this is the best solution to that problem I can think of :-) We can separate the carry/overflow bits from the control flags without generating any extra output dependence.

We can have an instruction to extract the carry bits to an integer bitfield. This will make it possible to do a fast carry-lookahead for a BIGNUM addition in a large vector register with just a few integer instructions.

I agree that integer registers should not have these extra bits.

The idea of having separate registers for system code only, sounds good. Would eight 64-bit registers be a suitable number?

I have fiddled a little with the instruction format. The coding gets simpler when we get rid of the vector length register and a few option bits. I think this makes it possible to make the instructions shorter so that we can avoid triple-size instructions. Of course, we will have to use two instructions for loading a 64-bit immediate value then, but I think we can live with that.

   
Do we need instructions with two outputs?
Author:  Date: 2016-04-02 02:09
Hubert and Agnes, I have a related but slightly tangential question, since we're talking about TLBs.

In our applications, within a virtual memory space, why do we still use conventional -- and enormous -- memory addresses?

We use 64-bit (or 48 actual I think on x64). Since it's a program's exclusive virtual memory space, a universe of our own, why can't we use arbitrary and much, much smaller addresses? Like 1, 2, 3, etc., basically a byte or two. If the TLB is going to have to translate anyway, is it a problem to translate small addresses into conventional physical ones? Processes could have unique IDs - one byte would be ample most of the time - so the TLB would have a unique identifier for all addresses (process ID + the process' internal small memory addresses), and it would often take only two or three bytes total.

You could do a variety of things with pages and their sizes in that scenario. What am I missing here? I usually stick to managed languages.

Separately, I think it would be fruitful to think about how an ISA could be designed to help JITs, and common parsing workloads like the web. If we designed an ISA to increase the performance of browser JS engines,. NET and Java VMs, etc. what would that look like? Would it be instruction selection that mattered most, or other aspects of the architecture? Intel rolled out some great instructions for string parsing in SSE 4.2, but my impression is that hardly any developers know about them or use then (I couldn't find them in nginx or Apache source, or browsers I checked, and they could really benefit.) That raises a separate issue in getting developers to pay attention to ISAs and optimization wins...

   
Do we need instructions with two outputs?
Author: Agner Date: 2016-04-02 03:03
Joe Duarte wrote:
why do we still use conventional -- and enormous -- memory addresses?

We use 64-bit (or 48 actual I think on x64). Since it's a program's exclusive virtual memory space, a universe of our own, why can't we use arbitrary and much, much smaller addresses? Like 1, 2, 3, etc., basically a byte or two.

Most operating systems have now switched to 64-bit addresses. It is true that most applications can do with a private 32-bit address space, but not all. A video editing program, for example, may need more than 4 gigabytes of data, and future needs may be still more. Better have 64-bit addresses to fit future needs than using complicated memory bank swapping and the like.

Separately, I think it would be fruitful to think about how an ISA could be designed to help JITs, and common parsing workloads like the web. If we designed an ISA to increase the performance of browser JS engines,. NET and Java VMs, etc. what would that look like? Would it be instruction selection that mattered most, or other aspects of the architecture?
I don't know if a JIT compiler needs anything special. Maybe string compare, but we can do that with vectors of 8-bit elements (this will work with UTF-8 strings). Anything else you have in mind for JIT compilers?

My proposal includes variable-length vector registers that enable the software to adapt automatically to the different vector sizes of different processors without recompiling. If one compiled executable file fits all variants of the processor, why do we need JIT compilers at all?

Interpreters for byte code languages have a lot of jump tables or tables of function pointers. My proposal has an instruction for efficient handling of jump/call tables of 32-bit pointers relative to an arbitrary reference point.

Intel rolled out some great instructions for string parsing in SSE 4.2, but my impression is that hardly any developers know about them or use then
The SSE4.2 instructions are ingenious, but very complicated for programmers to use. Most text strings intended for human reading are not so long that the speed of text processing really matters. The SSE4.2 instructions may be useful for other purposes, e.g. DNA sequence analysis.
   
Do we need instructions with two outputs?
Author:  Date: 2016-04-02 17:09
Agner wrote:
Most operating systems have now switched to 64-bit addresses. It is true that most applications can do with a private 32-bit address space, but not all. A video editing program, for example, may need more than 4 gigabytes of data, and future needs may be still more. Better have 64-bit addresses to fit future needs than using complicated memory bank swapping and the like.
Am I correct in recalling that x86-64 doesn't actually expose a 64-bit address space, but rather a 48-bit one? See stackoverflow.com/questions/6716946/why-do-64-bit-systems-have-only-a-48-bit-address-space

However, this doesn't matter for my purposes. I'm asking why, in virtual memory, we mirror the physical memory addressing scheme. Why does a process use an address like 0x7fffd1510060 when it could use and address like 1 or 2 or D4? It's the process' exclusive virtual memory space – wouldn't it save a lot of memory if it could use one-byte pointers? I image that the TLB or MMU can translate these virtual addresses just as easily as it translates 0x7fffd1510060.

I've also wondered if we could use time stamps as virtual memory addresses – start the clock at zero nanoseconds for each process, every allocation marked by the nanoseconds transpired since the start of the process, each process having its own little Unix epoch if you will. This would also be more compact than the status quo, given some simple truncation and compression techniques. Time-stamped allocations might also be useful for a capabilities-based system, like the CHERI ISA at Cambridge or the Barrelfish OS that Microsoft and ETH Zurich have worked on.

Agner wrote:

I don't know if a JIT compiler needs anything special. Maybe string compare, but we can do that with vectors of 8-bit elements (this will work with UTF-8 strings). Anything else you have in mind for JIT compilers?
Parsing, parsing, and more parsing (and lexing). I'm not sure that processor and ISA designers have thoroughly explored how parsing performance might be improved. And of course the actual compilation to machine code. JITs have to make hard trade-offs with respect to generating maximally optimized code vs. generating code quickly. They have to forsake some of the optimizations that a static C/C++ compiler would provide. (Relatedly, you might find this interesting. Apple recently added LLVM as last stage compiler for their WebKit/Safari JavaScript JIT, but more recently replaced it with a completely new compiler. Very interesting deep dive here: https://webkit.org/blog/5852/introducing-the-b3-jit-compiler/)

The other big thing JITs and many non-JIT runtimes have to do is garbage collection. I think it's well worth thinking about how an ISA could be designed to optimize garbage collection. There are some papers out there on hardware-accelerated garbage collection, but I haven't seen anyone model how an ISA's design decisions could help (or hurt) garbage collection.

Agner wrote:

My proposal includes variable-length vector registers that enable the software to adapt automatically to the different vector sizes of different processors without recompiling. If one compiled executable file fits all variants of the processor, why do we need JIT compilers at all?
We need them for the web, for JavaScript. This will be the case for many years to come. And for REPL and interpreter-like execution environments. Julia comes to mind the most.

WebAssembly is rolling out later this year. It looks excellent and thankfully everyone is behind it: Microsoft, Mozilla, Google, and perhaps Apple. It will be a partly compiled bytecode that browsers can execute much, much faster than JavaScript. However it won't replace JavaScript, as it's meant more for games, multimedia, and compute-intensive workloads. For now, the only source languages supported are C and C++, but more are expected. https://github.com/WebAssembly/design

Now that I think about it, you really ought to be involved in that project. They would benefit from your input. Relatedly, the web standards authorities and browser makers have been working on SIMD.JS, which I think would also benefit from your insights. I'm surprised they haven't asked for your help (if in fact they haven't). https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SIMD


Agner wrote:

The SSE4.2 instructions are ingenious, but very complicated for programmers to use. Most text strings intended for human reading are not so long that the speed of text processing really matters. The SSE4.2 instructions may be useful for other purposes, e.g. DNA sequence analysis.
I don't think text length is the issue. The new instructions are mostly designed to parse XML (and would work just as well for parsing any kind of structured text, HTML, even bytecode depending on some particulars.) From one of the Intel papers:

XML documents are made up of storage units called entities, like Character Data, Element, Comment, CDATA Section, etc. Each type of entity has its own well-formed definition that is a serious of character range rules.[1] The main work of Intel XML parsing is to recognize these entities and their logic structures.

From Intel XML Parsing Accelerator, we found that character checking loop occupies more than 60% CPU cycles of the whole parsing process, depending on the property of benchmark. There are two kinds of important behavior in this loop, read bytes and check whether it is legal for its corresponding entity type. Without any parallel instructions for string comparison, this process must be implemented in serializing mode.

(From https://software.intel.com/en-us/articles/xml-parsing-accelerator-with-intel-streaming-simd-extensions-4-intel-sse4)

I think the instructions would be useful for superfast user agent detection in web servers. I think PCMPESTRI and the other instructions work with 16-byte strings, and you could probably take a 16-byte chunk of a certain area of the user agent string that would uniquely identify the key factors you cared about across all user agents, like mobile or not, specific browser and version (which could, for example, tell you if you could use the vectors in SIMD.JS because you'd know which browsers support it.) The web is too slow, and I think common web servers, applications, and databases would be much faster if they used modern CPU instructions in their code.

(Example user agent string: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36)

Cheers,

Joe D.

   
Do we need instructions with two outputs?
Author: Agner Date: 2016-04-02 04:26
I have found a way to do addition of very large integers without the need for special carry bits:
  1. Put the two big numbers into vector registers A and B, or as much of the numbers that will fit into the maximum vector length
  2. Calculate the sums of all 64-bit vector elements: C = A + B
  3. Make a vector of the carries, one bit in each 64-bit element: D = (C < A)
  4. Find elements with all ones: E = (C == -1)
  5. Combine bit 0 of each element of C into an integer bitfield: D1 = convert_boolean_vector_to_bitfield(D)
  6. Do the same with E: E1 = convert_boolean_vector_to_bitfield(E)
  7. Use integer operations to do the carry look-ahead. D1 is the elements that generate carry, E1 is the elements that propagate carry. F = E1 xor (E1 + 2D1 + carry_in)
  8. Convert back from bitfield to vector: F1 = convert_bitfield_to_boolean_vector(F)
  9. Add the propagated carries: Sum = C + F1
  10. Carry out if more rounds needed: Carry_out = F >> number_of_vector_elements

If we don't need extra carry bits for high precision arithmetics, then the only need for these extra bits is for detecting and propagating integer overflow.

If we don't have the extra carry flags, then we will need special instructions for checking if an addition, multiplication, etc. overflows. After each addition or other arithmetic instruction, issue another instruction with the same inputs just to check if the operation overflows. The outputs of all the overflow checks for a series of calculations should then be OR'ed together. The total number of instructions will be three times the number of instructions needed without overflow check.

So what are the costs and benefits? Without the extra flag bits we need three times as many instructions if we want to check for integer overflow. We can avoid overflow in many cases by using 64-bit integers, but even 64-bit integers can easily overflow if the calculation has many multiplications. Compiler support for checking integer overflow is rare, unfortunately, so any mechanism for detecting integer overflow will perhaps not be used much. On the other hand, if an efficient method was available, we would probably see compilers that support it and programmers that use it. In fact, many programmers are frustrated over how difficult it is to detect signed integer overflow. Traps for integer overflow exception is not good for vector code. One solution is to use floating point calculations instead of integer, but I don't see that solution used much. Floating point calculations have longer latencies, of course. No solution seems to be really good here.

   
Do we need instructions with two outputs?
Author: Hubert Lamontagne Date: 2016-04-02 14:13
Agner wrote:
Thank you for your comments. It is nice to have sparring partners to discuss with. We are really getting somewhere.
Ha yeah. "Sparring partner" is a great way to put it. :3

I think there are many advantages to storing the vector length in the vector register itself. This makes it much cheaper to save callee-save registers: You only have to save the part of the register that is actually used. I plan to make special instructions for save/restore. These instructions will save or restore the length and as much of the data as is actually used. The save instruction can increment a pointer by the actual amount of space used. The restore instruction will have to be followed by an extra instruction to increment or decrement a pointer by the amount of space used. The same instruction can be used to adjust the stack pointer before saving. This will save a lot of data cache. If a very long vector register actually contains only a floating point scalar then we need not save any more than that. If the vector register is unused, which will probably happen more than 50% of the time, then we only need to save the length, which is zero. Also, we don't need a complicated lazy save mechanism for task switches. And we get one less input dependence because we don't need a separate register for vector length.

One catch is that vector registers need special alignment. For instance,
if your vector regs are 512bits and your DCache width is 512bits, you
want your vectors to be 512bit aligned when saved to the stack, so you
need to 512-align your stack pointer. Also, as the vector register size
increases, the stack size allocated to applications has to grow because
an app that was tested and known to work with 128bit vector saves/restores
might cause stack overflows if vectors become 512bits!

In light of this, I'd suggest:
- Scalar floats should get their own 32/64bit registers. In some C++
programs (games, audio, etc) they see lots of use and are often passed
to/from functions, so they need to be easy to save/restore. Since they have
the same 4/8byte alignment as 32/64bit ints, this is very easy to do if
you mandate an 8byte aligned SP in the calling convention.

- In fact, I'd even suggest a second, different stack for vector register
saves/restore. This way, the CPU can keep the VSP (vector stack pointer)
in perfect alignment all the time without having to do dynamic realignment
on the regular SP, and it can have special instructions that adjust the VSP
the right way for each vector, and the OS can grow the vector stack to make
sure that a program that expects 128-bit vector alignment will never generate
stack overflows when alignment grows to 256-bit+ etc...

I see the whole sequence for saving multiple vectors going something like this:

SUB sp, 12
; special instruction that gets the number of empty bytes to adjust vsp so that
; the next vector save is aligned and can be done in a single DCache cycle
GETVECTORSTOREPREPAD r0, vsp, v0.size
ST r0, [sp + 0]
ADD vsp, r0
ST v0, [vsp]
GETVECTORSTOREPREPAD r1, vsp, v1.size
ST r1, [sp + 4]
ADD vsp, r1
ST v1, [vsp]
GETVECTORSTOREPREPAD r2, vsp, v2.size
ST r2, [sp + 8]
ADD vsp, r2
ST v2, [vsp]

The whole sequence can probably be streamlined somwhat but I
hope you get what I mean here.

The disadvantage is, of course, that the compiler needs to distinguish between saving the data of a vector and saving everything for a later complete restore. Once we have this distinction, however, there is little extra cost to also saving a few carry flags.

There's an extra cost, it's just in a different non-obvious place:
it forces the compiler to figure out if the carry bits are relevant
for each operation in a chain, and if the compiler can't figure it
out it will output less efficient. Whereas if carry flags are in their
own register, and only written/read by some operations (on ARM,
when SUB generates flags it is called SUBS and is a different
instruction), then the compiler only ever has to worry about carry
flags for instructions that expressly read/write them (ADDS / ADC /
ADCS / SUBS / SBC / SBCS / CMPS etc), and then it just
becomes one extra register pool in the compiler's register allocator.

The use of separate flag registers that only get written to by a limited
subset of operations is also good for CPUs, because then it needs
a much, much less aggressive unit for handling flag register writes/reads
since it only needs to handle 1 write/read per cycle, instead of like
3/4+ like on x86.

The idea of having separate registers for system code only, sounds good. Would eight 64-bit registers be a suitable number?

8 registers for general purpose OS usage sounds good. It can probably
be all combined in an instruction for reading/writing CPU control registers,
like the CR0..CR15, DR0..DR15, MXCSR, MSW, GDTR, IDTR, LDTR, TR,
CS, SS, DS, ES, FS, GS, and floating point control registers on x64.

Some system operations could also be implemented as reads/writes to system
registers: CPUID could be replaced by a bunch of read-only system registers
that give the CPU model for instance.

Having lots of potential system registers available would probably help with
limiting the need to add new system opcodes as new features are added.

Joe Duarte wrote:

Hubert and Agnes, I have a related but slightly tangential question, since we're talking about TLBs.

In our applications, within a virtual memory space, why do we still use conventional -- and enormous -- memory addresses?

We use 64-bit (or 48 actual I think on x64). Since it's a program's exclusive virtual memory space, a universe of our own, why can't we use arbitrary and much, much smaller addresses? Like 1, 2, 3, etc., basically a byte or two. If the TLB is going to have to translate anyway, is it a problem to translate small addresses into conventional physical ones? Processes could have unique IDs - one byte would be ample most of the time - so the TLB would have a unique identifier for all addresses (process ID + the process' internal small memory addresses), and it would often take only two or three bytes total.

You could do a variety of things with pages and their sizes in that scenario. What am I missing here? I usually stick to managed languages.

ARM and x86 handle this by having a 32-bit mode. :3

You could definitively argue for going for a 32-bit
architecture with a standard 64-bit extension like what
Risc-V does. This makes total sense for an architecture
that's going to be used in embedded devices.

Even MIPS falls in this case: ultimately, it was used in way
more PS2's, set-top boxes and so forth than in servers.
Because of this, its 32bit support was a good thing
(and you can even argue that the infamous branch-delay-slot
was actually GOOD for MIPS).

----

Anyhow, the issue here I guess is that if your app has pointers
smaller than 4-bytes, well:

- 3-bytes can't be aligned to cache lines. If you have an array
of pointers, eventually one of the pointers has to straddle
2 cache lines and that's a much larger penalty than some
wasted DCache due to using 4-byte pointers.

- 3-bytes is only a 16 Mb address space, and few languages
want to be limited to something that small. You cannot start
with 3-byte pointer program and then dynamically upgrade
everything to 4-bytes if you ever run out of space. Might as
well make everything 4-byte from the start.

- 2-bytes is only 64 kb. Sure, you could have 2-byte pointers,
then have all sorts of data zone selectors to grow out of that:
this is exactly how 16-bit x86 works, and programs just totally
outgrew this. Even the late generation of DOS games are way
too large for this and do everything in 32bits.

Anyhow, this is also why the TLB is 2 levels on 32bit x86: a lot
of apps are only going to use a few hundred kilobytes of ram,
so it makes sense to only have page tables for small sections
of ram. On the top level page table, all the unused 4mb
blocks-of-pages are simply marked as invalid, and extra page
tables are allocated as the program's memory usage grows.

Separately, I think it would be fruitful to think about how an ISA could be designed to help JITs, and common parsing workloads like the web. If we designed an ISA to increase the performance of browser JS engines,. NET and Java VMs, etc. what would that look like? Would it be instruction selection that mattered most, or other aspects of the architecture? Intel rolled out some great instructions for string parsing in SSE 4.2, but my impression is that hardly any developers know about them or use then (I couldn't find them in nginx or Apache source, or browsers I checked, and they could really benefit.) That raises a separate issue in getting developers to pay attention to ISAs and optimization wins...

Unfortunately, this kind of higher-level stuff typically has all
sorts of special cases and exceptions that make them
impossible to implement in fast circuits.

For instance, JS interpreters often have the trick of having
numbers as 64bit floats, and other types as an invalid
64bit float with a pointer in it. Well, now instead of just
adding two floats together like in C++, you first have to decide
whether one of your floats is actually a string and you have
to trigger string to double conversion.

Or worse even, in some languages it can be some type with
an overloaded conversion to double, and the conversion code
can call all sorts of functions generating all sorts of side effects
(like printing to the console), which means that the compiler
can't even change the order of two additions to make things
faster because it can introduce new bugs.

String parsing also falls in this kind of case. For instance,
checking the length of a string is not so obvious: often the
interpreter knows the number of bytes of a string, but some of
those bytes are UTF-8 characters, and if you account for special
cases (like ending up with Latin-1 aka CP-1252 text), then there's
often really no alternative to just looping through the string byte
per byte, in which case it's impossible to be faster than x86 / ARM /
MIPS even though those chips have no special support for this.

This is also why so few higher level languages support multiple
threads / multi-processor : there's just so many weird states that
the interpreter can be in, and there are just so many weird possible
side-effects, that you can't let a second core go through that data -
it would just create all sorts of nasty race conditions and so forth.

-----

Java and .NET are in a lesser class of badness, in that a Java
float is just a float and you can add it normally, and you can easily
run Java multi-threaded (and it has a nice way off adding mutexes
to objects). But then again, Java benefits way less from special
architectures - if your Java program can't run fast on x86, then it
can't run fast on anything!

The real cost of Java is the Garbage Collector, which makes it
impossible to avoid garbage-collection pauses, which are often
longer than 100ms, or even occasionally longer than 300ms.
This is perfectly fine for server software (and Java is an enterprise
language so it fits this perfectly), but it makes Java unsuitable for
software that can't really have pauses like games and heavy-duty
user-facing GUI software (Photoshop, C++ IDEs, Word, iPhone
apps etc). This cannot be solved by processor design.


Agner wrote:

If we don't need extra carry bits for high precision arithmetics, then the only need for these extra bits is for detecting and propagating integer overflow.

If we don't have the extra carry flags, then we will need special instructions for checking if an addition, multiplication, etc. overflows. After each addition or other arithmetic instruction, issue another instruction with the same inputs just to check if the operation overflows. The outputs of all the overflow checks for a series of calculations should then be OR'ed together. The total number of instructions will be three times the number of instructions needed without overflow check.

So what are the costs and benefits? Without the extra flag bits we need three times as many instructions if we want to check for integer overflow. We can avoid overflow in many cases by using 64-bit integers, but even 64-bit integers can easily overflow if the calculation has many multiplications. Compiler support for checking integer overflow is rare, unfortunately, so any mechanism for detecting integer overflow will perhaps not be used much. On the other hand, if an efficient method was available, we would probably see compilers that support it and programmers that use it. In fact, many programmers are frustrated over how difficult it is to detect signed integer overflow. Traps for integer overflow exception is not good for vector code. One solution is to use floating point calculations instead of integer, but I don't see that solution used much. Floating point calculations have longer latencies, of course. No solution seems to be really good here.

Dunno, I write C++ sound code, and for code with lots
of multiplications that could overflow, I use floating point
code _all the time_. :3

Floating point does have higher latency, but that's true of
any code that uses lots of multiplication. Float multiplication
also has the really nice property that you need a lot less
shifts and clamps than fixed point code, so for algos with tons
of multiplications (say, FFT for instance), floating-point
is really really useful.

If I need to do multiplications in integer code (for instance,
in a FM synthesizer - can't use float because converting
float to int to use as array indexes is slow), then what I
generally do is that I just make sure my * operands are
essentially 16-bits (or, say, one operand is 24-bits and the
other is 8-bits), and carefully use lots of >> bitshifts to keep
everything in range. For sound generation, this is generally
fine because the human ear isn't that 'precise' anyways
(in terms of bits).

Another special case was writing ARM NEON code, which had
lackluster floating-point and only a 16x16 multiplier in the hardware
(32x32 multiplies had a significant penalty). So I used a lot of a
particular opcode called VQDMULH.s16 - 16x16->16 vector
signed doubling multiply keeping the top part of the result and
clamping the special case of -32768* -32768 to 32767 - equivalent to this:

res = (int16_t)a * (int16_t)b;
res = res >> 15;
if(res == 32768)
res = 32767; // clamp the particular case of -32768 x -32768
(int16_t)out = res

----

That being said, I have had to use 32x32->64 wide multiplication
in code 2 or 3 times. But I think it's a special case, compared to
the 32bit float multiplication, which I use all over the place!

   
Do we need instructions with two outputs?
Author: Agner Date: 2016-04-03 13:49
Hubert Lamontagne wrote
One catch is that vector registers need special alignment. For instance, if your vector regs are 512bits and your DCache width is 512bits, you want your vectors to be 512bit aligned when saved to the stack, so you need to 512-align your stack pointer.
The code has to be compatible with different processors with different vector sizes, so we don't want the required alignment to depend on the processor. A separate stack for vectors is a possible solution, but very wasteful. Cache space is a limiting resource so I don't want to save a possibly very long vector when only a small part of it is used. This is why I think it is smart to save the vector length in the vector register itself. A save instruction will save only as much as is needed. In the caller-save situation, the function knows how much of the register is saved, so it can use a normal store instruction. In the callee-save situation, the function will rely on the register length information and use the special save instruction to save only what is needed. This situation will be rare anyway, because the 16 caller-save registers will be enough for must purposes.

Separate registers for floating point scalars would be useful if we had to save the full vector in a callee-save situation, but the length information eliminates this need.

I think we can live with 8-bytes alignment of vectors. The hardware can handle this with a simple barrel shifter, but it may have to load an extra cache line, of course. Large arrays should be aligned by the cache line size for optimum performance.

I think the format for saving vector registers with the special save instruction should be implementation dependent. It may use a single byte for the length if the length cannot exceed 128 bytes, or it may use more for longer vectors or for the sake of alignment. It may even compress the data if this can be done fast enough. For example, a boolean mask vector using only one bit of each 64-bit element can obviously be compressed a lot. The format will be padded to fit whatever alignment is optimal on the particular processor. The software should use data in the special "save format" for no other purpose than to restore the register on the same processor.

It is a disadvantage that the saved format may be longer than the maximum vector length when it includes the length information. But I think this is outweighed by the advantage that most saved registers will use less space. Many registers will be unused and store only a zero for the length.

There's an extra cost, it's just in a different non-obvious place: it forces the compiler to figure out if the carry bits are relevant for each operation in a chain, and if the compiler can't figure it out it will output less efficient. Whereas if carry flags are in their own register, and only written/read by some operations (on ARM, when SUB generates flags it is called SUBS and is a different instruction), then the compiler only ever has to worry about carry flags for instructions that expressly read/write them (ADDS / ADC / ADCS / SUBS / SBC / SBCS / CMPS etc), and then it just becomes one extra register pool in the compiler's register allocator.
As I wrote, I have found an alternative solution for add with carry. We only have to consider whether we need an efficient way of tracking integer overflow.

CPUID could be replaced by a bunch of read-only system registers that give the CPU model for instance.
Good idea!

Joe Duarte wrote:

Since it's a program's exclusive virtual memory space, a universe of our own, why can't we use arbitrary and much, much smaller addresses?
My priority is the performance of big systems. That's why I have 64-bit address space. All addresses are relative to some pointer (instruction pointer, data section pointer, stack pointer, or an arbitrary pointer) with a signed offset of 8 bits or 32 bits. The instruction size will not be reduced by having a smaller address space, but of course we could save some stack space by having a 32-bit mode. I don't like having two different modes, though. Then we would have problems with stack alignment for doubles, etc. Byte code languages can have their own smaller address space of course.

String parsing also falls in this kind of case. For instance, checking the length of a string is not so obvious: often the interpreter knows the number of bytes of a string, but some of those bytes are UTF-8 characters, and if you account for special cases (like ending up with Latin-1 aka CP-1252 text), then there's often really no alternative to just looping through the string byte per byte
I think we should use UTF-8 only. It is possible to search for a terminating zero by loading a full vector and compare all bytes in the vector with zero. My ABI requires a little extra space at the end of user memory to avoid access violation when reading past the end of a string that happens to be placed at the very end of user memory. But of course it is more efficient to save the length of the string.

I write C++ sound code, and for code with lots of multiplications that could overflow, I use floating point code _all the time_. :3
Hardware multipliers are expensive, and divisors are even more expensive. I wonder if we need to support multiplication and division of all operand sizes, including vectors of 8-bit and 16-bit integers, if programmers are using floating point anyway?
   
Do we need instructions with two outputs?
Author:  Date: 2016-04-03 16:51
Joe Duarte wrote:

Since it's a program's exclusive virtual memory space,
a universe of our own, why can't we use arbitrary and
much, much smaller addresses?

Agner replied:

My priority is the performance of big systems. That's
why I have 64-bit address space. All addresses are
relative to some pointer (instruction pointer, data
section pointer, stack pointer, or an arbitrary
pointer) with a signed offset of 8 bits or 32 bits.
The instruction size will not be reduced by having a
smaller address space, but of course we could save
some stack space by having a 32-bit mode. I don't like
having two different modes, though. Then we would have
problems with stack alignment for doubles, etc. Byte
code languages can have their own smaller address
space of course.

I just don't like all the waste with 64-bit types and pointers. It becomes less of an issue over time, but it still bothers me. Relatedly, I like what these Berkeley people did with Precimonious, a tool that tunes floating point precision and eliminates some of the waste: www.eecs.berkeley.edu/~rubio/includes/sc13.pdf

Question: Does an ISA really need to specify the number of architectural registers? What would the implications be of not doing so, and having an infinite number of architectural registers like LLVM's IR? It seems like the number of registers is a fiction anyway (I was stunned to discover that x86-64 processors from Intel and AMD have nearly 200 physical registers.) This would make the number of registers implementation-dependent, rather than part of the ISA specification. See Vikram Adve's recent talk at Microsoft: research.microsoft.com/apps/video/default.aspx?id=249344 (The Microsoft Research people must have been having a bad day or something – their questions and comments reveal that they thoroughly misunderstood his ideas.)

Question 2: Let's assume that we have registers R0 - R31. Might it be useful to also have an unspecified register – call it Rx – that basically tells the CPU "give me whatever register you have – I don't care which one". I can imagine some scenarios where this might be useful for a compiler. And it seems to fit with the reality of register renaming anyway.

   
Do we need instructions with two outputs?
Author: Hubert Lamontagne Date: 2016-04-03 19:00
Suppose the SP is at 0x02018, and the L1 cache lines are 64 bytes in size, and you want to save a vector that's, say, 24 bytes long (6*32bit floats). Then, you need to first save the control word that tells you the size, compression etc of the vector. Fair enough, the vector data goes to 0x01FE8..0x02017. And then you have to save the vector size control word, which puts you at 0x01FE4 if you assume 32bits... but this doesn't work because then your SP is not 8-byte aligned anymore for 64bit integer and floating-point value. So you must save the vector size word to 0x01FE0 instead, with some extra padding (and the CPU either stores the amount of padding in the vector size word, or recalculates the amount of padding from SP alignment and vector size when reloading the vector).

Another possibility is that you could add some post-padding, so that the vector line is saved to 0x01FD8..0x01FFF and the control word goes to 0x01FD0, so that the whole thing fits in a single cache line. The amount of post-padding must be saved in the vector size control word.

Yeah, it's doable. But it's a long multicycle instruction, probably microcoded - after all, it writes an unpredictable amount of bytes to unpredictable offsets, often spanning 2 different cache lines, and updates the SP, and involves multiple address calculations to figure out just how much pre-padding and post-padding you need to do to keep your stack and your data well aligned. And it's very likely to completely block memory operation reordering (ie act like a memory barrier) because it's too difficult for concurrent memory operations to figure out whether they will overlap or not.

Agner wrote:

Hardware multipliers are expensive, and divisors are even more expensive. I wonder if we need to support multiplication and division of all operand sizes, including vectors of 8-bit and 16-bit integers, if programmers are using floating point anyway?
Generally, 8-bit and 16-bit vector multiplications are provided in SIMD instruction sets to do stuff like movie decoding and software rendering (when OpenGL/DirectX are unavailable due to software constraints, such as running as a plugin). For scalars, 32*32->32 multiplies cover everything (and are common in C++ code), but some CPUs also provide 16*16->32 multiplies because they run faster (ARM).
   
Do we need instructions with two outputs?
Author: Agner Date: 2016-04-04 08:50
Joe Duarte wrote:
Does an ISA really need to specify the number of architectural registers? What would the implications be of not doing so, and having an infinite number of architectural registers like LLVM's IR?
What do you mean? If we have 1023 virtual registers like LLVM then we need 10 bits in the opcode for each register. Or do you want a rolling register stack? Then we have a problem when the register stack overflows. That would be quite wasteful if the overflow happens in the innermost loop.

Might it be useful to also have an unspecified register – call it Rx – that basically tells the CPU "give me whatever register you have – I don't care which one".
All of the registers behave like that in a superscalar processor. You ask for a particular architectural register and you get a random physical register - you don't even know which one you have got.

Hubert Lamontagne wrote:

then your SP is not 8-byte aligned anymore for 64bit integer and floating-point value. So you must save the vector size word to 0x01FE0 instead, with some extra padding (and the CPU either stores the amount of padding in the vector size word, or recalculates the amount of padding from SP alignment and vector size when reloading the vector).
Yes, this is a complication. I want to handle it in software without any complex instructions. If the size of the saved register image is not guaranteed to be a multiple of the stack word size then I would first calculate the amount of space needed for all the vector registers I want to save, then save the stack pointer to another register, then subtract the necessary size from the stack pointer, then align the stack by 8 ( AND SP,-8 ), then use a temporary register as pointer to the save area, and then save the registers, incrementing the pointer each time. The restore process is easier. There is no problem with saving registers during a task switch because you have a pre-allocated space that is big enough to cover all cases.

The alternative is to give the saved image of each register a size that is a multiple of the stack word size. This will make it easier to spill registers on the stack at the cost of using more space. It will make it simpler for the compiler and also easier for the hardware because of the better alignment. The cost is that it uses more cache space, which is probably more expensive than the extra instructions. If all vector registers are unused, then we will need only one or two bytes for saving each with the first method, versus eight bytes with the second method.

It is difficult to weigh the costs/benefits of these two solutions against each other, but you are right that the first method is very complicated.

Generally, 8-bit and 16-bit vector multiplications are provided in SIMD instruction sets to do stuff like movie decoding and software rendering
x86 does not have a complete set of vector multiply instructions. NEON has 8, 16, and 32 bits. I don't see what you would you need 8-bit vector multiply for?
   
Do we need instructions with two outputs?
Author: Hubert Lamontagne Date: 2016-04-04 21:01
Joe Duarte wrote:

Question: Does an ISA really need to specify the number of architectural registers? What would the implications be of not doing so, and having an infinite number of architectural registers like LLVM's IR? It seems like the number of registers is a fiction anyway (I was stunned to discover that x86-64 processors from Intel and AMD have nearly 200 physical registers.) This would make the number of registers implementation-dependent, rather than part of the ISA specification. See Vikram Adve's recent talk at Microsoft: research.microsoft.com/apps/video/default.aspx?id=249344 (The Microsoft Research people must have been having a bad day or something – their questions and comments reveal that they thoroughly misunderstood his ideas.)

This has a cost:
- You can make instructions variable-sized to accomodate different numbers of registers, but this increases branch mispredict penalty and makes it hard to run multiple instructions at the same time.
- What if your cpu has 32 registers and the program uses the 33rd? You can spill values to memory, but then the CPU has to figure out that it doesn't conflict with any other memory reads/writes.
- More registers = instructions become larger. MIPS instructions would take 42bits instead of 32bits if you had 1024 registers instead of 32.
- Larger register files are slower, which reduces clock rate or causes more stalls due to more latency cycles required to get register values.

Question 2: Let's assume that we have registers R0 - R31. Might it be useful to also have an unspecified register – call it Rx – that basically tells the CPU "give me whatever register you have – I don't care which one". I can imagine some scenarios where this might be useful for a compiler. And it seems to fit with the reality of register renaming anyway.

Okay, you can write a result to "Rx", but how do you find which register that "Rx" is once you want to read back your result and do something with it? What if you write to multiple "Rx"'es, how do you keep track of what went where?

------------------------------------

Agner wrote:

Yes, this is a complication. I want to handle it in software without any complex instructions. If the size of the saved register image is not guaranteed to be a multiple of the stack word size then I would first calculate the amount of space needed for all the vector registers I want to save, then save the stack pointer to another register, then subtract the necessary size from the stack pointer, then align the stack by 8 ( AND SP,-8 ), then use a temporary register as pointer to the save area, and then save the registers, incrementing the pointer each time. The restore process is easier.
Oh, I see. You'd use some kind of massive "vector store/load (including size prefix byte)" instruction that's basically never aligned to save all the vectors, then reestablish stack alignment. And for C++ ABI, you'd force all caller functions to save all SIMD vectors and floats and doubles, and use the caller's knowledge of what's in the registers to do a simpler aligned non-variable-sized save (instead of a massive unaligned variable-sized save)... On paper it works, but for some reason I find that rather scary.... :3

For instance, if you have a function working on a bunch of scalar floats, and then it calls some sub-function (say, "sqrt()" or something like that), won't it have to spill every single float register that it's working on unto the stack (potentially on every iteration of a loop)?

Generally, 8-bit and 16-bit vector multiplications are provided in SIMD instruction sets to do stuff like movie decoding and software rendering
x86 does not have a complete set of vector multiply instructions. NEON has 8, 16, and 32 bits. I don't see what you would you need 8-bit vector multiply for?
I'm thinking of the case of 32bpp RGBA bilinear interpolation/texture mapping/rotozoom, along with alpha blending, for the cases where you can't use OpenGL (for instance in Macromedia Flash). That's a notorious CPU pipeline buster, because the algo eats up a ton of small multiplications.
   
Do we need instructions with two outputs?
Author:  Date: 2016-04-06 19:51
Hubert said:
- 3-bytes can't be aligned to cache lines. If you have an array of pointers, eventually one of the pointers has to straddle 2 cache lines and that's a much larger penalty than some wasted DCache due to using 4-byte pointers.

- 3-bytes is only a 16 Mb address space, and few languages want to be limited to something that small. You cannot start with 3-byte pointer program and then dynamically upgrade everything to 4-bytes if you ever run out of space. Might as well make everything 4-byte from the start.

You're assuming byte-addressable memory. I'm assuming that these pointers or references would point to memory objects of arbitrary size, determined by what the variable, object, or function needs. I don't see why a program can't just tag its objects and entities in a virtual memory space with clean and compact pointers (but without garbage collection – just virtual memory.) I feel like there's not enough *virtual* in virtual memory right now – we should be able to abstract more.

Agner wrote:

Joe Duarte wrote:
Does an ISA really need to specify the number of architectural registers? What would the implications be of not doing so, and having an infinite number of architectural registers like LLVM's IR?
What do you mean? If we have 1023 virtual registers like LLVM then we need 10 bits in the opcode for each register. Or do you want a rolling register stack? Then we have a problem when the register stack overflows. That would be quite wasteful if the overflow happens in the innermost loop.

I mean register assignment could be sorted out at install time, or what's commonly called Ahead of Time compilation. One way to think of it is that some of what an LLVM back end does could be postponed until the application is installed and the precise characteristics of CPU are known. If you look at the SPIR-V IR just released from Kronos, I think some of the optimization will happen at install. And I think the Mill CPU architecture does this as well – the code is partly compiled into a "specification" until it's installed and knows which version of the CPU the user has.

Adve is talking about something similar in his talk. Also see Nuzman et al's "Vapor SIMD: Auto-Vectorize Once, Run Everywhere": https://www.irisa.fr/alf/downloads/rohou/doc/Rohou_CGO11.pdf

On the issue of a Rx register, I think it might be a useful abstraction in some cases. You said that CPUs do this already, that we don't know what register we're getting. Yet, we're still naming registers explicitly, and you find it useful to retain named registers in an ISA. There are benefits to have named architectural registers, and I think there would be benefits from anonymous register semantics. Hubert asked how we'd refer back to it. There would be rules about how such register semantics could be used, and how to manage them – they wouldn't be the same as the normal registers. There are few ways to go about it.

   
Do we need instructions with two outputs?
Author: Hubert Lamontagne Date: 2016-04-07 23:32
Joe Duarte wrote:
Hubert said:
- 3-bytes can't be aligned to cache lines. If you have an array of pointers, eventually one of the pointers has to straddle 2 cache lines and that's a much larger penalty than some wasted DCache due to using 4-byte pointers.

- 3-bytes is only a 16 Mb address space, and few languages want to be limited to something that small. You cannot start with 3-byte pointer program and then dynamically upgrade everything to 4-bytes if you ever run out of space. Might as well make everything 4-byte from the start.

You're assuming byte-addressable memory. I'm assuming that these pointers or references would point to memory objects of arbitrary size, determined by what the variable, object, or function needs. I don't see why a program can't just tag its objects and entities in a virtual memory space with clean and compact pointers (but without garbage collection – just virtual memory.) I feel like there's not enough *virtual* in virtual memory right now – we should be able to abstract more.
Okay, but how do you get the individual fields out of the object you've got the reference of? Then you need both the object's handle/tag/reference/id, and an offset from the start of the object data (traditionally in bytes), or at least some kind of field ID (but that adds yet another translation pass to get the real byte offset).

The other issue is that you're probably going to hammer the TLB a lot harder that way: you're making the TLB hold all your real pointers instead of just page remaps. Which means you'll probably need a bigger, more complex TLB.

Third, this doesn't play well with C++ specifically, because objects aren't necessarily housed within their own memory allocation, they can be contained inside a larger object, or inside an array, or on the stack. So you need to save not only the handle, but also the byte offset. This is essentially the same thing as the infamous FAR pointer from 16-bit x86 coding.

Joe Duarte wrote:
https://www.irisa.fr/alf/downloads/rohou/doc/Rohou_CGO11.pdf

On the issue of a Rx register, I think it might be a useful abstraction in some cases. You said that CPUs do this already, that we don't know what register we're getting. Yet, we're still naming registers explicitly, and you find it useful to retain named registers in an ISA. There are benefits to have named architectural registers, and I think there would be benefits from anonymous register semantics. Hubert asked how we'd refer back to it. There would be rules about how such register semantics could be used, and how to manage them – they wouldn't be the same as the normal registers. There are few ways to go about it.

Then you've either got a register stack like the 8087 fpu or some ultra-low-power CPUs (GreenArrays chips, designed to run Forth), or a queue like in the Mill (which it calls the "belt") or the Itanium rotating register file, depending on if your system forgets the newest values after use or the oldest ones.
   
Do we need instructions with two outputs?
Author:  Date: 2016-04-08 05:46
Hi, related to the JIT discussion I think the proposed ISA needs specific instruction for memory copying, initialization and comparison. Especially, memory copying from small number of bytes to big is important in GC related scenarios. Often memory copy ends up being implemented as multiple complicated vector instructions, when an ISA instruction would be so much better (the CPU could then handle this in the most optimized way for it). Some research has been done on this where adding "cpblk", "initblk" instructions where evaluated and these showed great benefit for code size and speed. I would allow these instructions to have element size defined i.e. 1, 2, 4, 8, 16 bytes for example so one can initialize a "double" array quickly, perhaps even a zero out "zeroblk" instruction since this is used a lot in managed memory scenarios.
   
Do we need instructions with two outputs?
Author: Hubert Lamontagne Date: 2016-04-09 00:06
HarryDev wrote:
Hi, related to the JIT discussion I think the proposed ISA needs specific instruction for memory copying, initialization and comparison. Especially, memory copying from small number of bytes to big is important in GC related scenarios. Often memory copy ends up being implemented as multiple complicated vector instructions, when an ISA instruction would be so much better (the CPU could then handle this in the most optimized way for it). Some research has been done on this where adding "cpblk", "initblk" instructions where evaluated and these showed great benefit for code size and speed. I would allow these instructions to have element size defined i.e. 1, 2, 4, 8, 16 bytes for example so one can initialize a "double" array quickly, perhaps even a zero out "zeroblk" instruction since this is used a lot in managed memory scenarios.
That's why x86 has the 'string copy' etc instructions. And the speed gain depends on the CPU generation...:

- On 8086 / 80286 / 80386, loading instructions was slow (no instruction cache!), so they string instructions were the fastest indeed. Other CPUs of the time also have stuff like this: z80 has string copies too, as does 65816, and 68000 has move multiple.

- On 486 and Pentium, designers realized that non-microcoded instructions can be implemented to run a lot faster, but there was only so much space on chips so only few instructions could get the speedup, so string instructions were left out and actually ran slower... RISC CPUs don't have any string copy instructions either - it just doesn't fit in their design!

- On later out-of-order x86 CPUs, string copies are still microcoded so they're slow to start, but once started they do wide multibit copies. So the end result is similar to the multiple complicated vector instructions anyways (since CPUs still have to deal with alignment). On the latest Ivy bridge and Haswell, string copy instructions are now the same speed as software loops - but that's a new thing: on preceding architectures, string copy instructions are still slower... RISC architectures like ARM mostly went with vector instructions to fill this need.

So I'm not sure that string copy/zero/compare instructions are still relevant nowadays because memory bandwidth is not all that fast, and the kind of pointer updating that was costly on 8 and 16 bit CPUs is now free since it runs in ALUs in parallel while the data cache does its job. And having CPU string copy instructions doesn't remove the need for complex cacheline-aligned loops - it's really just moving that complexity from the software to the microcode. I think it would be better to look into garbage collection algorithms that don't do compaction and don't recopy everything (and an end to Java's "allocate humongous blocks of memory on startup" approach).

   
How about stack machine ISA?
Author:  Date: 2016-04-10 07:35
It should be too late.
but since I found a research today, I ask for it.

BOOST: Berkeley's Out-of-Order Stack Thingy
https://www.researchgate.net/publication/228556746_BOOST_Berkeley's_Out-of-Order_Stack_Thingy

Same as "register renaming" for register machine, these researcher propose "address renaming" for stack machine to implement out-of-order execution.
So I think there are no reasons to prevent stack machine for performance problem.
Since stack machine instruction sets never bother us with number of registers, I think stack machine's ISA is more extensible than register machine's.
Also, stack ISA is easer for compilers as UCSD p-System showed.
it should also hold true for Just In-time Compiling.

   
treating stack ISA as CISC architecure
Author:  Date: 2016-04-14 02:55
Above BOOST architecture converts each one stack instruction to corresponding one RISC-like micro code.
Since stack ISA is more fine than even load-store RISC ISA which puts opcode field and operand field together in a instrcution, BOOST generates more RISC-like micro code, requires more massive monster renaming unit than even RISC architecture.
I think this is the major difficulty of BOOST, same as RISC requires more instruction throughput than CISC.
Since each entry of reservation station in a conventional CISC machine handles "macro code" which represents read-modify-write sequence and generates several RISC-like micro code, CISC machines beat RISC machines.
I think the source of CISC power is such a powerful RS entry, which is toward a tiny microprocessor same as EDGE.
Explicit_Data_Graph_Execution - Wikipedia
en.wikipedia.org/wiki/Explicit_Data_Graph_Execution
This means more powerful macro code reduces more pressure on a renaming unit, thus exploits more performance.
For example, read-modify-read-modify-read-write macro ops such as below line is preferred to read-modify-write ops.
read [sp + 2] ; read [sp + 4] ; mul ; read [sp + 5] ; add ; load-from-heap ; write [sp + 2] ;
But you will see this one macro op is similar to a sequence of 7 instructions written in stack ISA.
Thus, I think stack ISA can treat as CISC architecture with instruction border marking unit.
Maybe algorithm like matching parentheses will be used at border marking.
Note that instruction border is determined by machine implementation, not by ISA like x86.
If a machine can't deal with big macro ops like above, its border marking unit can split stack ISA sequence to fit its macro ops.
read [sp + 2] ; read [sp + 4] ; mul ; //read-read-modify-(write to stack)
read [sp + 5] ; add ; //(read from stack)-read-modify-(write to stack)
load-from-heap ; write [sp + 2] ; //(read from stack)-read-write
Or, since simple stack machine (one issue, in-order) is so tiny that it may fit in a RS entry, you might be able to treat a block separated between jump instructions as one macro op.
   
treating stack ISA as CISC architecure
Author: Agner Date: 2016-04-14 09:03
Thank you for the reference to Explicit Data Graph Execution (EDGE) https://en.wikipedia.org/wiki/Explicit_data_graph_execution
If the EDGE principle can be implemented smoothly, it might be more efficient than splitting a job into multiple threads running in multiple cores. The thread management, synchronization, and communication between threads is very complicated and inefficient in current systems. An alternative would certainly be worth exploring.

However, I don't quite understand how the Hyperblocks are coded and delimited.

   
treating stack ISA as CISC architecure
Author:  Date: 2016-04-17 01:51
Below recipe is what I'm reaching and can present now.

-- begin recipe --
(0)
Prepare an empty stack "Parser stack".
Each item of stack holds immediate value or loadiing from architectural stack which must be renamed to register file.
But the bottom item of this stack may hold a Hyperblock.

(1-arg)
If next instruction is an immediate or loading from architectural stack, push it into Parser stack.
ex.
Parser stack before : { [sp+3] ; $4 ; add } ; $5 ;
next instruction : [sp+5]
Parser stack after : { [sp+3] ; $4 ; add } ; $5 ; [sp+5] ;

(1-op)
If next instruction is a N-in-1-out (inc, add, fma3, etc) or N-in-0-out ("store [sp+2] top-of-stack", etc) micro op
[1] Pop N items from Parser stack, concatenate them with the next instruction into a Hyperblock.
[2] If items remain in Parser stack, output each of them as individual Hyperblock.

[3-1out] If next instruction is 1-out op, leave the Hyperblock made at [1] on the bottom of Parser stack.
ex.
Parser stack before : { [sp+3] ; $4 ; add } ; $5 ; [sp+5] ; $7
next instruction : add(2-in-1-out op)
outputs :
{ [sp+3] ; $4 ; add } ;
$5 ;
Parser stack after : {[sp+5] ; $7 ; add } ;

[3-0out] If next instruction is 0-out op, output the Hyperblock at [1], thus Parser stack should be empty.
ex.
Parser stack before : { [sp+3] ; $4 ; add } ; $5 ; [sp+5] ; $7
next instruction : store [sp+2] top-of-stack (1-in-0-out op)
outputs :
{ [sp+3] ; $4 ; add } ;
$5 ;
[sp+5] ;
$7 ; store [sp+2] top-of-stack ;
Parser stack after : (empty)

(1-other)
For other cases, 2-out op like "divmod", reaching max length of Hyperblock, control flow instruction and so on.
output each of items in Parser stack as individual Hyperblock, thus Parser stack should be empty.

(2) To iterate, go back to ether of (1-*) depending on next instruction.
-- end recipe --

Though I mentioned before RS entry should be toward individual processor, they must be tiny enough to lack lots of unit.
Anyway, each RS entry in a current conventional CISC processor lacks ability for flow control , "mov" elimination by renaming, nor Out-of-Order execution.
So Hyperblock should not contain "mov" instruction nor concurrent flow like below.
($4 $5 add) ($2 $3 shift-left) // these process have no dependency, thus should detect concurrency.
swap-top-2-item div // "swap-top-2-item" should be eliminated with renaming.
This is the reason Parser stack should become empty every time when it meets a micro op.
Parser stack is parse time emulation of the stack inside a RS entry.
You may point out
$4 ($2 $3 shift-left) add
has no concurrent flow.
I ignore them to make recipe simple to implement easily on hardware.

For ops which have 2 or more outputs, I have not gotten the easy way how to treat them.

   
treating stack ISA as CISC architecure
Author: Hubert Lamontagne Date: 2016-04-17 14:35
Agner wrote:
Thank you for the reference to Explicit Data Graph Execution (EDGE) https://en.wikipedia.org/wiki/Explicit_data_graph_execution
If the EDGE principle can be implemented smoothly, it might be more efficient than splitting a job into multiple threads running in multiple cores. The thread management, synchronization, and communication between threads is very complicated and inefficient in current systems. An alternative would certainly be worth exploring.

However, I don't quite understand how the Hyperblocks are coded and delimited.

I do have a design in store that does what EDGE does, and should be targettable from C++ compilers but it's kinda weird:

- Registers are the accumulator (ac), and a rotating register file with 4 partitions of 16 registers (a0-a15, b0-b15, c0-c15, d0-d15). The ALLOC instruction shifts down register file names, so for instance ALLOC d10..d15 will move down the previous contents of d15 to d9, the content of d14 to d8, d13->d7, d12->d6, d11->d5, d10->d4, d9->d3, d8->d2, d7->d1, d6->d0, and the contents of d0..d5 are lost. The new values in registers d10..d15 are marked as "uninitialized" and instructions that try to read them will stall until the registers are written to. ALLOC is always in the form of aN..b15, bN..b15, cN..c15, dN..d15 and can allocate from multiple partitions at the same time.

- Each non-accumulator register can only be written to once (!). This means that it's basically a form of hardware SSA. This is why there are multiple partitions: values have multiple classes of life duration: extremely short (accumulator), very short (would probably use d0..d15, typically used for multi-use temporary values and merging 2 branches of calculation), loop counters (c0..c15, get rewritten every loop), and then various long lived values like loop constants and stack pointers and the like (aN and bN).

- Instructions come in groups. Each group is a sequence of serial instructions, and operates on the accumulator: the first operation of the group must not have any dependency on the previous state of the accumulator, but then subsequent instructions modify the accumulator. Every instruction always writes to the accumulator, plus optionally also has a store to a register from the rotating register file. Example of group:
mul ac, d14, b15 ;no dependency on the previous state of the accumulator, start of group
sar ac, 16
add ac, d13, st d12 ;result goes in both accumulator and d12
add ac, d15, st d11 ;result goes to both accumulator and d11, group ends

- Memory loads/stores must use [register+immediate] as addressing mode (it's not possible to use the accumulator as part of address calculations), and are separately issued in-order, independent of ALU instruction grouping.

- The ALLOC instruction is also not part of ALU instruction grouping.

- The reason for this weird design is that it makes register renaming very easy: every time an ALLOC instruction comes up, you simply increase the rotation index for the target register partitions and clear the 'value ready' bit for the new register names (and also check that the physical registers you're using are free). Once that is done, every single following ALU operation group is _guaranteed_ to be parallelizable (aside from the memory loads/stores) because every register name is unique!

- Each new ALU instruction group is queued to a free ALU, and each ALU runs one instruction per cycle from its group of cycles. If the instruction's rotating register file operand doesn't have its 'value ready' bit on, then this instruction group stalls until the value becomes ready (in other words, it waits until the value gets stored by an ALU or memory load operation).

- Register file renaming can be done late by the ALUs, since the rotating register file index of each partition is recorded when the group starts. This also means that every ALU can practically have its own ICACHE - they can be _scheduled_ 100% independently from other instructions!

- Multiple ALU groups can be scheduled per cycle, in any order(!). The only thing that has to be done in order is register renaming using the ALLOC instruction and memory loads and stores.

- For the C++ compiler, working from SSA, it has to tag all the operations that are part of a series of operations, to form groups. This is not that hard as far as I can tell: every time a value is only used by the next operation in the sequence, then it can be passed using the accumulator instead. Generally, all single-destination values can be assigned to the accumulator. Since the compiler knows how long each value lives, it can make sure register file renames using ALLOC never wipe out values that are still in use (although it does have to figure out the lifespan of each value).


This design still needs lots of refinement (especially in the area of loads/stores, calling convention...) and is more complex than I'd like (hence the 'this is weird' warning), and probably also fairly out-there, and potentially doesn't gain too much over the traditional out-of-order RISC (if the whole thing is limited by memory performance in particular), and reminds me of the Itanium at times, but it does have the advantage of being relatively implementable and compilable, and in theory there's no limit to the number of ALUs you can run in parallel (I can easily see designs with 4, 8, even more ALUs).

   
stack ISA versus long vectors
Author: Agner Date: 2016-04-18 00:37
Hubert Lamontagne wrote:
I do have a design in store that does what EDGE does, and should be targettable from C++ compilers but it's kinda weird:

- Registers are the accumulator (ac), and a rotating register file with 4 partitions of 16 registers (a0-a15, b0-b15, c0-c15, d0-d15). The ALLOC instruction shifts down register file names, so for instance ALLOC d10..d15 will move down the previous contents of d15 to d9, the content of d14 to d8, d13->d7, d12->d6, d11->d5, d10->d4, d9->d3, d8->d2, d7->d1, d6->d0, and the contents of d0..d5 are lost. The new values in registers d10..d15 are marked as "uninitialized" and instructions that try to read them will stall until the registers are written to.

Thank you for explaining your idea. It might be a problem that you have only one accumulator.

The best candidate for an independent instruction block is a loop iteration with no loop-carried dependency. I think it would be easier for the compiler in this case to just use a very long vector register with variable length to cover multiple iterations of the loop at once.

The main problem with very long vectors is instructions that move data horizontally across a vector. The latency of horizontal data moves may increase with the vector length. I have an idea to mitigate this problem a little. All instructions that involve horizontal data movement across a vector have information about the distance of the move (e.g. index or shift count) in a separate register or an immediate constant. The scheduler wants to know the latency of the instruction as early as possible. It will be allowed to read the "distance register" at an early stage in the pipeline before the other operands are ready. This value will typically be available early anyway thanks to out-of-order execution. There is probably no way to avoid the data transfer delay, but horizontal moves will be rare anyway. Current designs are already reading registers used in address calculation earlier in the pipeline than operand registers. I want to use a similar mechanism for predicting instruction latency.

It will also be an advantage to know the vector length early so that it can clock gate or power down unused parts of the buses and ALUs to save power.

   
stack ISA versus long vectors
Author: Hubert Lamontagne Date: 2016-04-19 23:16
Agner wrote:
Thank you for explaining your idea. It might be a problem that you have only one accumulator.
It's on purpose! If you need a second accumulator, that means that your calculations has two branches to it (either from a split or a join), which means it has some parallelism to it, which means you want it to run on more than one ALU to exploit that parallelism, which means you need some sort of register to send values between the ALUs. That register for sending values between ALUs can be a second accumulator, but then that's a lot harder to rename because then the scheduler has to go through the whole stream of instructions and give a whole bunch of different names. What I'm suggesting is that this crazy renaming should be done beforehand, and that's what the whole rotating name single-write register file is there for: it gives a different names to all the renamed versions of the second accumulator beforehand, so that the scheduler has to do renames much less often, and it can do a whole bunch of renames at the same time.

Agner wrote:

The best candidate for an independent instruction block is a loop iteration with no loop-carried dependency. I think it would be easier for the compiler in this case to just use a very long vector register with variable length to cover multiple iterations of the loop at once.

The main problem with very long vectors is instructions that move data horizontally across a vector. The latency of horizontal data moves may increase with the vector length. I have an idea to mitigate this problem a little. All instructions that involve horizontal data movement across a vector have information about the distance of the move (e.g. index or shift count) in a separate register or an immediate constant. The scheduler wants to know the latency of the instruction as early as possible. It will be allowed to read the "distance register" at an early stage in the pipeline before the other operands are ready. This value will typically be available early anyway thanks to out-of-order execution. There is probably no way to avoid the data transfer delay, but horizontal moves will be rare anyway. Current designs are already reading registers used in address calculation earlier in the pipeline than operand registers. I want to use a similar mechanism for predicting instruction latency.

It will also be an advantage to know the vector length early so that it can clock gate or power down unused parts of the buses and ALUs to save power.

Hmm, that's not too hard to do if the vector lengths are in the normal register file, and the SIMD unit is running late compared to the scalar integer unit. Maybe you could have multiple swizzles with different granularities and delay times - like a 1 byte granularity scramble, and then maybe a 4 byte scramble, 16 byte scramble, a 64 byte scramble... Maybe the cpu could look in the swizzle data and figure out how many 0 bits there are and use the coarsest correct granularity.
   
stack ISA versus long vectors
Author: Agner Date: 2016-04-20 00:08
Hubert Lamontagne wrote:
If you need a second accumulator, that means that your calculations has two branches to it (either from a split or a join), which means it has some parallelism to it
OK, I see.

that's not too hard to do if the vector lengths are in the normal register file, and the SIMD unit is running late compared to the scalar integer unit. Maybe you could have multiple swizzles with different granularities and delay times - like a 1 byte granularity scramble, and then maybe a 4 byte scramble, 16 byte scramble, a 64 byte scramble... Maybe the cpu could look in the swizzle data and figure out how many 0 bits there are and use the coarsest correct granularity.
That's indeed very close to what I have in mind. A permute instruction will have the granularity given by the operand type, one vector output, two vector inputs for input data and indexes, and then a "block size" given by an integer register. For example, if the block size is 16 bytes then data can only be moved within each 16-bytes block of the vector. There has to be a maximum block size anyway because the complexity of the circuit grows with the square of the block size and the latency also grows with the block size. The maximum block size may be implementation dependent and it may be smaller for low granularities.
   
treating stack ISA as CISC architecure
Author:  Date: 2016-04-18 08:09
Well..., I forgot explaining why the bottom item of Parser stack can hold Hyperblock and other items can't.
Forgive me adding a loaf of explanation below.

-- begin --
(1-op)
If next instruction is a N-in-1-out (inc, add, fma3, etc) or N-in-0-out ("store [sp+2] top-of-stack", etc) micro op
[1] Pop N items from Parser stack, concatenate them with the next instruction into a Hyperblock.
ex.
Parser stack before : { [sp+3] ; $4 ; add } ; $5
next instruction : add(2-in-1-out op)
Hyperblock : { [sp+3] ; $4 ; add ; $5 ; add }
Parser stack after [1] : (empty)
[2] If items remain in Parser stack, output each of them as individual Hyperblock.
[3-1out] If next instruction is 1-out op, leave the Hyperblock made at [1] on the bottom of Parser stack.
ex.
Parser stack after [2] : { [sp+3] ; $4 ; add ; $5 ; add }
-- end --

Thanks for Reverse Polish Notation, you feel free replacing a constant value into an instruction sequence which provides a value.
ex.
$4 ; $3 ; add
can be replaced into
($2 ; $2 ; add) ; ($9 ; $3 ; div) ; add
and more
(($5 ; $3 ; $-13 ; fma3) ; ($1 ; $1 ; add) ; add) ; (($5 ; $4 ; add) ; (return-three-func-addr ; call-from-addr) ; div) ; add
and so on.
But since each instruction sequence has no dependency on others, replacing more than one value give a Hyperblock more than one flow.
I want to avoid concurrent flow in a Hyperblock, thus the recipe allows only one item in Parser stack to hold a Hyperblock.

   
Proposal for an ideal extensible instruction set
Author:  Date: 2016-04-11 03:24
Hi,

I think it might be worth considering other three operand single rounding mode instructions besides FMA.

As I discovered here
stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble
double-double multiplication can be done with

high = a * b;
low = fma(a, b, -high);

However double-double addition cannot be done so simply. If there was a three operand single rounding mode (a+b+c) instruction then double-double addition would be much faster (actually (a+b-c) is perhaps the most interesting). I realize this is may seem to be a very special case. But double-double is a kind of poor man's quad precision float point. So in that sense it's maybe not such a special case for those interested in quad precision floating point. The only thing holding double-double back (if a instruction set already has FMA) is single rounding mode (a+b-c) and this would be a lot simpler to implement than quad precision floating point support I think.

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2016-04-11 08:04
HarryDev wrote:
I think the proposed ISA needs specific instruction for memory copying, initialization and comparison.
My experience with x86 processors is that these microcoded instructions have very long setup times and that software loops with vector registers are faster in most cases. I want to avoid complex microcode instructions. Instead, I have designed instructions for very fast software loops using variable-length vector registers.

A-11 wrote:

BOOST: Berkeley's Out-of-Order Stack Thingy
Thanks for the interesting reference. As I understand the article, it is more difficult to do out-of-order processing on a stack machine than on a conventional machine.

zboson wrote:

double-double multiplication can be done with

high = a * b;
low = fma(a, b, -high);

However double-double addition cannot be done so simply. If there was a three operand single rounding mode (a+b+c) instruction then double-double addition would be much faster (actually (a+b-c) is perhaps the most interesting).

Interesting idea. Actually, I once used that fma trick to get extra precision in a multiplication. So as I understand your idea, you want to do:
high = a + b
low = a + b - high // (with extra precision in the intermediate)
I think this would be quite costly to implement in hardware, considering the rare use. Instead, you can do this:
high = a + b;
if (abs(a) > abs(b)) {
   low = b - (high - a);
} else {
   low = a - (high - b);
}
This does not require any extra hardware.
   
Proposal for an ideal extensible instruction set
Author: Hubert Lamontagne Date: 2016-04-11 21:03
There is an algo for doubling the precision of floating point adding:

Kahan summation


A-11 wrote:

It should be too late.
but since I found a research today, I ask for it.

BOOST: Berkeley's Out-of-Order Stack Thingy
https://www.researchgate.net/publication/228556746_BOOST_Berkeley's_Out-of-Order_Stack_Thingy

Same as "register renaming" for register machine, these researcher propose "address renaming" for stack machine to implement out-of-order execution.
So I think there are no reasons to prevent stack machine for performance problem.
Since stack machine instruction sets never bother us with number of registers, I think stack machine's ISA is more extensible than register machine's.
Also, stack ISA is easer for compilers as UCSD p-System showed.
it should also hold true for Just In-time Compiling.

That's an interesting architecture. Though I'm not sure I like how it's basically doing 2 or 3 memory accesses per cycle but then removing these accesses out using aggressive renaming, and that the stack and function-local memory accesses seem to not totally consistent with the "real" memory loads/stores through pointers. (otherwise it would have to stall every time no?)

Come to think of it, are there other "not totally consistent" memory models that are useful, performance wise? (aside from the standard multi-core - with - memory-barriers - and-so-forth stuff)

The one thing I do like about stack based architectures is how they reduce the number of register file read/writes, and intuitively I'd be more interested into a model that (1) also has registers (ie it's not using the stack to store values but rather to keep track of very short lived temporary values - and it has the nice property that the lifespan of values is well known) (2) has a limited stack stack depth - something like 16 values with no automatic spilling (ie just enough for calculations, function calls would still be done using the normal RISC mechanism and explicit memory operations for register/stack spills) (3) you're supposed to keep the stack empty as often as possible so that the CPU can automatically figure out which sections it can run in parallel (each time the size falls to 0, that's a new parallelizable section).

   
Proposal for an ideal extensible instruction set
Author: Agner Date: 2016-04-12 00:12
Hubert Lamontagne wrote:
The one thing I do like about stack based architectures is how they reduce the number of register file read/writes, and intuitively I'd be more interested into a model that (1) also has registers (ie it's not using the stack to store values but rather to keep track of very short lived temporary values - and it has the nice property that the lifespan of values is well known) (2) has a limited stack stack depth - something like 16 values with no automatic spilling (ie just enough for calculations, function calls would still be done using the normal RISC mechanism and explicit memory operations for register/stack spills) (3) you're supposed to keep the stack empty as often as possible so that the CPU can automatically figure out which sections it can run in parallel (each time the size falls to 0, that's a new parallelizable section).
That sounds useful, but isn't this the same as x87? The main problem with x87 is that you have to do a lot of register swapping to get each operand to the top of the stack when you need it. Each value gets swapped to different positions several times through its life time so that it gets difficult to track where each value is. I haven't seen any compiler that can handle register variables nicely and keep them on the stack. If you want to avoid all that swapping then you need to access registers that are not on the top of the stack, and then the stack idea sort-of disappears. Is there a better way of doing this?
   
Proposal for an ideal extensible instruction set
Author: Hubert Lamontagne Date: 2016-04-12 23:59
Agner wrote:
Hubert Lamontagne wrote:
The one thing I do like about stack based architectures is how they reduce the number of register file read/writes, and intuitively I'd be more interested into a model that (1) also has registers (ie it's not using the stack to store values but rather to keep track of very short lived temporary values - and it has the nice property that the lifespan of values is well known) (2) has a limited stack stack depth - something like 16 values with no automatic spilling (ie just enough for calculations, function calls would still be done using the normal RISC mechanism and explicit memory operations for register/stack spills) (3) you're supposed to keep the stack empty as often as possible so that the CPU can automatically figure out which sections it can run in parallel (each time the size falls to 0, that's a new parallelizable section).
That sounds useful, but isn't this the same as x87? The main problem with x87 is that you have to do a lot of register swapping to get each operand to the top of the stack when you need it. Each value gets swapped to different positions several times through its life time so that it gets difficult to track where each value is. I haven't seen any compiler that can handle register variables nicely and keep them on the stack. If you want to avoid all that swapping then you need to access registers that are not on the top of the stack, and then the stack idea sort-of disappears. Is there a better way of doing this?
Not quite... x87 lacks non-strack registers and isn't designed to help the CPU figure out what sections of the calculation are parallelizable. If you start doing register swapping (like the whole fxch+op thing) then it just becomes a softa register file. Here, the CPU starts a new parallelizable section every time the stack size falls to zero. Each parallelizable section has its own stack, and it can be shown that the sections can't interfere (register file accesses still have to be renamed and memory conflicts still have to be solved though, and branch prediction fail can invalidate speculative sections). A similar architecture can be designed using an accumulator+a register file.

Example code (linear interpolation of 16bit data):


loop:
push r9
shr 16
push short [r8 + stack*2]
pop r0 ;renames r0

;this section can execute in parallel (once registers have been renamed)
push r9
shr 16
add 1
push short [r8 + stack*2]
sub r0 ;waits after renamed r0
push r9
and 0xffff
mul stack
sar 16
add r0
pop short [r7]

;this section can execute in parallel
push r9
add r10
pop r9 ;renames r9

;this section can execute in parallel
push r7
add 2
dup
pop r7 ;renames r7
cmp r11
branch if lower to loop

   
Version 1.01
Author: Agner Date: 2016-05-10 10:28
The instruction set specification is now updated to version 1.01: www.agner.org/optimize/instructionset.pdf

The instruction set has got the name CRISC1 to indicate the compromise between RISC and CISC.

All user-level instructions are now defined.

The most important change is that the length of a vector is now saved in the vector register itself. This has many advantages. When you need to save a vector register in a function or at a task switch, you only have to save the part of the register that is actually used. If the register is unused then you only have to save a zero for the length. Please see the document for further discussion of the advantages.

   
Version 1.01
Author: Hubert Lamontagne Date: 2016-05-13 16:05
Nice. Do you have any plans for writing an assembler or a simulator?
(or other more involved stuff like a verilog/vhdl implementation or a c compiler... though that's a lot of work!)
   
Version 1.01
Author: Agner Date: 2016-05-14 05:23
Hubert Lamontagne wrote:
Do you have any plans for writing an assembler or a simulator?
(or other more involved stuff like a verilog/vhdl implementation or a c compiler... though that's a lot of work!)
That is the plan. Right now I am working on the ELF object file format for this architecture.

But I am very busy with other projects too, so progress will be slow as long as this is a one man project.

I have an idea for how to implement system calls. It is a system call instruction which takes as input a 64-bit ID number. The upper 32 bits is a module ID which identifies a system module or device driver (The system core has ID = 0). The lower 32 bits identify a function within this module. System add-on modules and device drivers do not necessarily have fixed ID numbers because this would require some central authority to assign these ID numbers. Instead, the program will have to ask for the ID number by giving the name of the module. The functions within a module can have fixed or variable ID numbers.

There will be a system function (with fixed ID number) which takes the names of module and function as input and returns the ID number. The ID number can retrieved in this way before the first call to the function.

The ID number can be put into the program in three ways:
1. The most important system functions have fixed ID numbers which can be inserted at compile time.
2. The ID number can be found at load time in the same way as relocation works. The loader will find the ID number and insert it in the code before running the program.
3. The ID number is found at run time before the first call to the desired function.

The system call instruction can get the necessary ID number either in a 64-bit register operand or in an immediate constant with 32 bits for the module ID and 16 bits for the function ID (assuming that the module has no more than 65000 functions).

The calling convention for system functions is the same as for other functions, using registers for parameters and for return value. Any register not used for a parameter to the system function can be used for the ID of the function.

The system call instruction will change the instruction pointer and stack pointer and proceed to system code. The old values of instruction pointer and stack pointer are saved in special registers, to be restored by a system return instruction.

   
Version 1.01
Author:  Date: 2016-06-02 02:13
progress will be slow as long as this is a one man project.
Could I suggest that you move the development incl. discussions of CRISC to GitHub instead? This would no doubt make collaboration easier, allow for pull requests, etc. And allow for topic focused issue discussions so the many different aspects of CRISC can be discussed in different threads.

Practically, I would change the CRISC document to a set of GitHub flavored markdown files. These are easily editable and work well with git version control. In time code, simulators etc. can be added to this project as well.

I believe this would make this project more available to a wider audience and increase collaboration. You could setup an organization for this on github and also use several repositories if this would be better.

   
Public repository
Author: Agner Date: 2016-06-02 05:16
Harry wrote:
Could I suggest that you move the development incl. discussions of CRISC to GitHub instead?
You are right. This project has developed to more than I initially expected and it is approaching a level where it makes sense to move it to a public repository. What are the pros and cons of the different public repositories, such as GitHub, SourceForge, Savanna, etc.?

Version 1.02 of the CRISC1 document is underway with more detailed specifications of system calls, memory management, object file format, etc. This will no longer fit into a single file.

   
Public repository
Author:  Date: 2016-06-02 13:54
Not that I have great amount of experience with other public repositories, but GitHub really has a very good story with regards to Issues and discussions related to this. It is also easy to do pull request and discussion around these which encourage collaboration. Additionally, MarkDown is supported as default and rendered on the web page. It also supports tables, which not all markdowns do. In addition, the GitHub for Windows install is great for working with git from the "Git Shell" powershell command line. Since it is git you will always have a local copy of everything, and can use your own custom git repo for backup or similar, if you ever want to move to something else.

GitHub is also becoming the go-to place for open source projects from Microsoft e.g. .NET Core etc. and many others.

I would not recommend sourceforge at all. It is terrible in my view.

I would recommend GitHub.

Just to give an example you can see how the .NET coreclr repo looks like here: https://github.com/dotnet/coreclr and how docs are an integrated part of it.

   
Public repository
Author:  Date: 2016-06-02 13:56
You could use pandoc to convert Word docx (if that is your format) to markdown see ronn-bundgaard.dk/blog/convert-docx-to-markdown-with-pandoc/
   
Public repository
Author: Agner Date: 2016-06-09 11:53
Harry wrote:
Could I suggest that you move the development incl. discussions of CRISC to GitHub instead?
The name CRISC is taken on Github, but there is no activity whatsoever on the name and no contact address. I don't know if it is possible to take over an inactive github account.

Anyway, the world has enough short acronyms. Does anybody have a suggestion for a another name for this project?

   
Rethinking DLLs and shared objects
Author: Agner Date: 2016-05-20 05:07
Windows systems use dynamic link libraries (DLLs) and Unix-like systems (Linux, BSD, Mac OS) use shared objects (SOs). Both types have a number of disadvantages. I have an idea for replacing DLLs and SOs with something more efficient in the new CRISC architecture.

Let me first explain how the current systems work. A normal program goes through the steps of compiling, linking, and loading. The linker joins multiple program modules and library functions together and adjusts all addresses in the linked program. Any absolute addresses in the code are adjusted according to the placement of each module, and all relative addresses from one module to another are calculated. This process is called relocation. The loader may do the relocation once again in case the program is placed at a memory address different from the address that was assumed in the link process. A DLL or SO is linked and loaded in basically the same way as an executable program.

The main difference between a DLL and a SO is that shared objects allow "symbol interposition". It is possible to override a symbol in a SO by defining another symbol with the same name in the calling program. This feature is intended to mimic the behavior of a static link library, but symbol interposition is hardly ever used and it comes at a high cost. Every access to a function or a global variable needs to go via a procedure linkage table (PLT) or a global offset table (GOT). This applies even to internal references inside the SO if the symbol is globally visible. It would be easy to bypass this time-consuming mechanism if the linker and loader allowed it, but for unknown reasons, they don't.

The advantage of using a DLL or SO is that the loaded code can be shared between multiple running programs. This is rarely saving any memory, however, because the library may contain hundreds of functions while you are only using a few of them. It is not uncommon to load a DLL or SO of one megabyte and use only one kilobyte of it.

Another problem that makes DLLs and SOs less efficient than static libraries is that they are scattered around in each their memory block. Each DLL/SO will use at least two memory pages, one for code and one for data. The scattered memory access makes caching less efficient.

Now, my proposal for the CRISC architecture is to get completely rid of DLLs and SOs. Instead, we will have only one type of function libraries that can be used in three different ways:

  1. Static linking. This will work the same way as today.
  2. Load-time linking. The library is linked with the executable program by the loader when the program is loaded into memory.
  3. Run-time linking. The library is loaded by commands in a running program.

In all three cases, we are loading only those functions from the library that are actually needed.

Load-time linking will be easier with the CRISC system than with existing systems because the CODE and DATA sections are independent of each other in the CRISC system. The CODE (and CONST) sections are addressed relative to the instruction pointer, while the DATA section is addressed relative to a special pointer called the data section pointer (DATAP). CODE and DATA can be placed anywhere in memory independently of each other. If extra library functions need to be linked in at load time, then the CODE and DATA sections of the library functions are simply appended to the CODE and DATA sections of the main program, and any new cross references are resolved. The result will be very efficient because the code and data of the library functions are contiguous with the code and data of the main program, so that caching is improved. There are no intermediate import tables or PLTs to slow down the execution.

Run-time linking is used less often. It is needed only when the choice of library depends on user input to the running program, or when a library is loaded explicitly from a not-compiled script language. The loader can use several different methods when run-time linking is requested:

  1. The main program may have reserved extra memory space for the library functions. This information is stored in the header of the executable program file. The library function is accessed through a function pointer which is returned by the load_library function. Any DATA section in the library can be addressed through DATAP, using the normal relocation procedure.
  2. If there is no reserved space, or the reserved space is too small, then the loader must place the library function somewhere else in memory. If there is a vacant memory space within a distance of +/- 2 GB from the address in DATAP then the same method as above is used.
  3. If there is no vacant space within 2 GB of DATAP then the loader can insert a stub that changes DATAP to point to the DATA section of the library function. The function is called through this stub, which changes DATAP when called, and restores the old value of DATAP on return. If the function can throw an exception then the exception handler needs to restore DATAP as well.
  4. The library function can be compiled with a compiler option that tells it not to use DATAP. The function will load the absolute address of its DATA section into a general purpose register and access its data with this register as pointer.

If lazy loading of a program module is desired then use the same method as for run-time linking, or put the lazy module into a separate executable file.

Newer versions of Linux have a feature called Gnu indirect function which makes it possible to choose between different versions of a function at load time depending on, for example, the microprocessor version. This feature will not be copied in the CRISC system because it relies on a PLT. Instead, we can make a dispatcher system to be used with load-time linking. The library can contain a dispatch function which tells which version of a library function to load. The loader will first load the dispatch function (possibly using run-time linking into itself) and call it. The dispatch function returns the name of the chosen version of the desired function. The loader then unloads the dispatch function and links the chosen function into the main program. The dispatch function must have access to information about the hardware configuration, command line parameters, environment variables, and anything else that it might need to choose which version of the function to use.

System functions and device drivers are called by using an ID number rather than a function pointer. This ID number can be resolved at link time, load time or run time just like library functions.

The advantages of my proposal are:

  • There is only one type of function libraries. The same library can be used with any of the three methods: static linking, load-time linking, and run-time linking.
  • Only the part of the library that is actually needed is loaded.
  • The code and data of the library is contiguous with the code and data of the calling program in most cases. This makes memory management simpler, avoids memory fragmentation, and improves caching.
  • There are no intermediate import tables, procedure linkage tables or global offset tables to reduce the performance.
Any comments?
   
Rethinking DLLs and shared objects
Author:  Date: 2016-05-20 12:33
Hi,

How would you deal with non-reentrant functions (and their global data) belonging to the same library in all your three types?
And subsequent non-exported & helper functions internal to the library that could potentially modify those global data?

   
Rethinking DLLs and shared objects
Author: Agner Date: 2016-05-20 13:51
cv wrote:
How would you deal with non-reentrant functions (and their global data) belonging to the same library in all your three types?
And subsequent non-exported & helper functions internal to the library that could potentially modify those global data?
Library functions should be thread-safe by default because we are aiming at high performance which often means multiple CPU cores running multiple threads. Of course it is possible to make functions that are not thread-safe or reentrant, whatever your reason might be for doing so. A function with static data is likely to be thread-unsafe unless you take special precautions. My proposal allows functions to have a static data section that can be used for whatever you want. If you want multiple library functions to share the same static data then you can either put them into the same module or put labels on the data with public visibility.
   
Rethinking DLLs and shared objects
Author:  Date: 2016-05-30 08:06

I like the idea of getting rid of the indirection on every call to a library function; that seems like a good plan. Current software is of course designed for the current situation of lazy dynamic linking. This non-lazy dynamic linking may not do very well with some existing software.

GUIs tend to use a lot of library code. I'm worried that copying so much code into private memory for every process will use significant amounts of memory. It won't be touched a lot of the time, but everything that's reachable from the QT or GTK API functions used by a program has to be read from disk and copied into the process's private memory, not just mmap(MAP_SHARED). This means it can only be paged out to swap space, since the library text isn't backed by a file on disk.

Web browsers build much of their code as libraries. Many browsers use multiple processes instead of just multiple threads, as an extra barrier against vulnerabilities. On Linux, I assume chrome just fork()s, so memory that isn't modified after that is shared. AFAIK, Windows doesn't have an equivalent system call, so each process might be stuck with its own private copy of all the library code.

In summary, I think the current state of desktop software is too bloated for this to be a good idea.

If we hope to ever see any of this proposal get used for real (at some point in the far future?), I think we might need some kind of lazy dynamic linking, or at least shared text segments for libraries. Vendors will always want to be able to sell cheap machines with the minimum amount of memory.

Non-lazy linking sounds great for long-running processes that don't overdo it with libraries (e.g. a web server process), but I'm also worried about higher startup overhead for commands used from shell-scripts. With lazy dynamic linking, code for functions that are never called (because that command-line option wasn't used) doesn't ever have to be read into memory. OTOH, most programs will use the same libc functions, and having them hot in the pagecache means it's cheap for the dynamic linker to memcpy them.

It would be interesting to run some tests comparing shell-script speed (cut/grep/sed/cat/find) with Linux lazy dynamic linking vs. this proposal. I think this could work on x86, or any other arch. Linux supports custom executable file formats, or for this experiment we could use ELF with a different dynamic linker as a custom ELF-interpreter. IDK if we'd want a different library object-file format, though, to make it easy and fast to recursively find all the functions a library-function could itself call.

   
Rethinking DLLs and shared objects
Author: Agner Date: 2016-05-30 11:11
Peter Cordes wrote:
GUIs tend to use a lot of library code. I'm worried that copying so much code into private memory for every process will use significant amounts of memory.
If the GUI library is big, it may run in a separate process. The cost of this is that communication between library and main program will be slower because it needs to switch the memory map.

most programs will use the same libc functions
I often use static linking of libc. It works fine and it doesn't copy too much code into the executable file. If a GUI library has a proper modular design, it should be possible to link it statically without copying everything. It might be possible to test this with some open source GUI library to see how much code will be copied with static linking.

Linux supports custom executable file formats, or for this experiment we could use ELF with a different dynamic linker as a custom ELF-interpreter.
That's interesting. I guess this requires that you make your own loader. I would still use ELF format but add a custom section type.
   
Rethinking DLLs and shared objects
Author:  Date: 2016-06-17 00:41
I don't think bloated GUI libraries will be an issue. Remember that Agner is proposing an entirely new ISA. Windows 8 and Mac OS 10.11 will never be ported to this ISA. Let's imagine that Agner's proposal garnered good reviews and wide industry attention. In an optimal scenario, we might see some version of CRISC implemented in silicon in 2023 or so. Realistically, any platforms deployed on it will be new, possibly built from scratch, like unikernels or Microsoft Research's library OSes. GUIs can be implemented very lightly now with immediate-mode libraries like imgui: https://github.com/ocornut/imgui

Statically-linked executables are gaining momentum with Go and, on the C side, musl. A musl executable is much lighter than a glibc one.

Agner, it seems like your proposal could either eliminate the need for tree-shaking, or require a new kind of tree-shaking, depending on implementation details. Would code be tree-shaken by default, in that nothing is included unless it's used, kind of an opt-in vs opt-out scenario?

You'll have to cover the security angles here or no one will accept it. There are lots of pointer memory management add-ons to C right now that make decisions about which kinds of pointers will be stored where and so forth. How will your ISA integrate with those methods? I'm thinking of things like Code Pointer Integrity and SafeStack: dslab.epfl.ch/proj/cpi/

   
Rethinking DLLs and shared objects
Author: Agner Date: 2016-06-18 00:44
Joe Duarte wrote:
GUIs can be implemented very lightly now with immediate-mode libraries like imgui: https://github.com/ocornut/imgui

Statically-linked executables are gaining momentum with Go and, on the C side, musl. A musl executable is much lighter than a glibc one.

Agner, it seems like your proposal could either eliminate the need for tree-shaking, or require a new kind of tree-shaking, depending on implementation details. Would code be tree-shaken by default, in that nothing is included unless it's used, kind of an opt-in vs opt-out scenario?

All function libraries will work like static libraries (*.lib in Windows, *.a in Unix) in the sense that it includes only what is needed.

You'll have to cover the security angles here or no one will accept it. There are lots of pointer memory management add-ons to C right now that make decisions about which kinds of pointers will be stored where and so forth. How will your ISA integrate with those methods? I'm thinking of things like Code Pointer Integrity and SafeStack: dslab.epfl.ch/proj/cpi/
Thanks for the interesting links. You are touching on an important topic. We have to think security into the system right from the beginning.
I have proposed a dual stack system where the call stack is separate from the data stack. A buffer overflow will be unable to compromise the return address of a function. Jump tables for switch/case statements and virtual function pointer tables for C++ polymorphism can be placed in the CONST section which is read-only. What remains is function pointers declared explicitly in the program. These may be placed in a protected static memory area or on the heap at a non-predictable address. An input buffer may be placed in its own sandbox where it is protected from overflow.
   
Rethinking DLLs and shared objects
Author:  Date: 2016-06-18 04:40
About security, some time ago I had something interesting (or not) in mind that could be useful for a new ISA.

What about making a special "function entry" instruction that has to be placed as a first instruction of every function? The call instructions would check whether they point to this special instruction and bail out if they are not. This way we are no longer able to call into a middle of a function. For tail-calls, we could make an instruction that behaves like jump but also checks for this restriction. I am not sure whether the extra instruction (or maybe a bit of a first instruction?) is worth it, but we could combine it with what is usually placed as a first instruction of a function (like stack manipulation in order to reserve local variables) in order to save space.

We could also place some kind of function metadata there (like used registers, which are for input, which are for output, etc.).

   
Rethinking DLLs and shared objects
Author:  Date: 2016-06-02 17:37
With regards to static vs dynamic linking it is probably worth considering the following note by Ulrich Drepper:

https://www.akkadia.org/drepper/no_static_linking.html

who was the maintainer of glibc for several years and the author of the excellent series on "What Every Programmer Should Know About Memory". On balance -- and speaking as someone who comes from the HPC side of things -- I am inclined to agree with his points.

With regards to: "It is not uncommon to load a DLL or SO of one megabyte and use only one kilobyte of it." on most (all?) systems .so's are mmap'ed and so benefit from on-demand paging; hence you only pay for cost for what you use viz-a-viz the page size. Further, while an application may call only a small number of functions from an .so those functions inside of the .so may go on to call a large number of additional -- internal -- functions. These 'iceberg' .so's are very common for GUI libraries and the such like. Of course these internal functions need not be exported and hence should not suffer from a performance penalty beyond that associated with being PIC.

   
Rethinking DLLs and shared objects
Author: Agner Date: 2016-06-04 03:12
Freddie Witherden wrote:
With regards to static vs dynamic linking it is probably worth considering the following note by Ulrich Drepper
Dynamic linking relies heavily on memory paging. My proposal for CRISC1 is to avoid memory paging completely. I am replacing dynamic linking with load-time linking which can do the same as dynamic linking with respect to updating a library without updating the calling program. There is even a feature for choosing between different libraries or functions at load time. Very large libraries, such as GUI frameworks, can be loaded as a separate process.

There are significant performance costs to DLLs and SOs which I want to avoid. Each DLL/SO has its own memory pages for code and data. You will be wasting a whole 4kB page even if the function uses only a few hundred bytes. The data section of a DLL/SO cannot be shared. The code section can be shared in most cases, but it is not shared if it contains relocated cross-references. The amount of memory that you are saving by sharing functions between multiple processes is very small compared to the amount of memory that you are wasting by loading a lot of other functions that your program is not using.

The dynamic libraries are scattered around in memory which means poor caching. Many functions are placed at the start of a memory page. This means that cache line sets with low numbers are used disproportionately more, or you need a more complicated cache design to distribute the load on cache line sets more evenly.

If a DLL contains many functions and some of these functions - not the ones you are using - link to other DLLs, then you will load these other DLLs unnecessarily, or suffer the unpredictable response time of lazy loading.

Unix-style shared objects have further performance losses due to the requirement of the so-called position-independent code, which actually involves something completely different, namely symbol interposition. 32-bit x86 shared objects have an additional performance problem because the instruction set has no position-independent addressing mode.

   
Rethinking DLLs and shared objects
Author:  Date: 2016-06-04 13:49
Out of interest have you checked out the work of Darek Mihocka who some years ago wrote a multi-part series on fixing x86 (and, interestingly, proposed eliminating hardware paging):

www.emulators.com/docs/nx03_10fixes.htm
www.emulators.com/docs/nx04_malloc.htm
www.emulators.com/docs/nx05_vx64.htm

With regards to the cost of PIC I do not believe that it is as high as it used to be. Case in point the majority of executable shipped by modern distributions are PIC as a means of improving the effectiveness of ASLR. Yet, when this change happened, I do not recall seeing complaints from users or any serious performance degradation.

   
Rethinking DLLs and shared objects
Author: Agner Date: 2016-06-06 14:25
Freddie Witherden wrote:
have you checked out the work of Darek Mihocka who some years ago wrote a multi-part series on fixing x86 (and, interestingly, proposed eliminating hardware paging)
Thank you for the interesting links. His experiments show that the costs of the complicated memory management and the huge page tables is even higher than I thought. I agree with him that we need a fundamental redesign.

Many of the security problems that Darek discusses can be solved with CRISC1. The memory map in my proposal is so small that it is no problem to give each thread in a process its own memory map and its own private data space that cannot be accessed from other threads. Do you know if any other systems have memory protection between threads? I can't find any (other than emulators). Of course we have thread-local storage, but that does not guarantee that one thread cannot corrupt data belonging to another thread in the same process in case of program bugs or malicious hacking.

I don't see any reason why different threads should have access to each other's private data. Synchronization structures (e. g. semaphores) and resource sharing between threads should go through the shared main memory of the application, not the private memory of one thread.

   
Is it better to have two stacks?
Author: Agner Date: 2016-06-05 13:26
In the beginning of this thread I argued against having a link register. Storing return addresses on the stack is simpler.

Now I wonder if it is better to have two stacks: a call stack for return addresses and a data stack for the local variables of functions. The call stack will be quite small in most cases because the nesting level of function calls is limited in most programs. We can have a small rolling stack on the chip which is used most of the time. If the on-chip stack overflows then it must be spilled to memory. Let's take an example: we have a rolling stack on the chip with 16 entries. We are running a program where function calls are nested 20 deep. The on-chip stack will be copied to memory at call level 17. The on-chip stack entries are overwritten one by one (oldest first) on each deeper call. After call number 20 there will be 4 entries that are overwritten. The first 16 returns after the deepest call can take the return addresses from the on-chip stack. We don't have to reload the on-chip stack from memory until we come down to level 4. We will never have to spill the on-chip stack to memory inside a loop more than on the first iteration unless there are very deep function nesting or recursive functions inside the loop. In other words, the costs of spilling the on-chip stack to memory are minimal because it will not occur repeatedly in a loop except in recursive functions.

In fact, modern microprocessors already have such a rolling call stack on the chip. It is used for predicting return addresses. We might as well use this structure as the genuine call stack rather than just a shadow stack used for prediction. The prediction of return addresses will then be simple and perfect, of course.

There is also a security advantage to having a separate call stack. The return address of a function cannot be overwritten by software bugs or malicious buffer overflow attacks. Legitimate attempts to change the return address of a function will also be prevented, of course, but this is bad programming anyway because it wrecks the prediction mechanism.

The mechanism of the on-chip stack can be hidden in special registers. The application does not need to access the stack pointer of the call stack, except for stack unwinding in the exception handler.

The cost of having two stacks is the complexity of saving the on-chip stack to memory when it overflows. This can be implemented in hardware or software. Memory management will also be a little more complex because there are two stacks that can overflow. The size of both stacks can be predicted by using the method explained in my document, except for recursive functions.

I will propose, tentatively, to allow both principles - one stack or two stacks - in the CRISC1 architecture. It does not matter to the software whether there is one or two stacks except when function parameters are saved on the stack. A function needs to know the addresses of its parameters relative to the stack pointer, and this depends on whether there is a return address in between. It is rare that we will have parameters on the stack because the first 16 general purpose registers and the first 16 vector registers can be used for function parameters.

If we want the same software to be compatible with both one-stack and two-stack systems then we need to solve the problem of the address of parameters on the stack, however rare it may be. The simplest solution is to put an empty space on the data stack where the return address would be if we have a separate call stack. But I want to suggest a smarter solution: don't put parameters on the stack. If there are more parameters than registers then put the extra parameters in a list and use one register to point to this list. This solution is simple and efficient. We are getting rid of the old discussion of which order of parameters on the stack is better, and whether the stack should be cleaned up by caller or callee.

So this is my proposal. Small simple systems can have one unified stack, and large systems where performance or security is important can have two stacks. The size of the on-chip call stack is implementation dependent. The calling convention is changed so that parameters are never saved on the stack. Application programs don't need to care whether there is one or two stacks, but the stack unwinding mechanism in the exception handler needs to know, of course.

   
Is it better to have two stacks?
Author: Hubert Lamontagne Date: 2016-06-07 16:46
Having a second stack in a small separate memory just for IPs does sound workable. It does have the advantage that CALL and RET aren't memory operations anymore, which means they take a single micro-op, and you get real CALL/RET opcodes instead of having to decompose them into jump-and-link+store LR and load LR+jump LR. So given the appropriate hardware implementation, it could reasonably translate into a small speed gain for code that does a lot of function calling (you're using one less data cache port and micro-op on every CALL and every RET).

This means that it could reasonably be worth the extra hardware complexity - I think it would probably need a write buffer and register-renamed SP so that it can be rolled back in case of a branch prediction failure/page fault/unpredicted reordered store to already loaded memory address/etc, so you have to factor this into the cost.

   
Is it better to have two stacks?
Author:  Date: 2016-06-13 22:40
What about debugging? Putting stuff on the cpu is good for performance, but if the table is not public for viewing, there are 16 function calls you don't know about.
While you can say it will be worked around in debug mode, debugging a optimized application is still necessary, and really hard on the debugger as it is.
   
Is it better to have two stacks?
Author: Agner Date: 2016-06-13 23:36
Eden Segal wrote:
What about debugging?
No problem. There will be instructions for accessing the on-chip call stack. These instructions must be accessible to debugger and exception handler. They may or may not be accessible to the application in user mode.
   
Is it better to have two stacks?
Author: Hubert Lamontagne Date: 2016-06-14 21:28
The other reason I can think of to have user-accessible stack is supporting coroutines.
   
Is it better to have two stacks?
Author: Agner Date: 2016-06-14 23:59
Hubert Lamontagne wrote:
The other reason I can think of to have user-accessible stack is supporting coroutines.
If coroutines are implemented with a call stack, it would be filled up. Are you thinking about removing an entry from the call stack before calling another routine? The same functionality can be obtained with a tail call or a state machine.

The next version of my proposal will have a calling convention that makes tail calls possible in all cases. In the rare case that there are more than 16 integer parameters or 16 vector parameters to a function, the remaining parameters will not be stored on the stack but in a list pointed to by a register. The same method applies to a variable arguments list.

   
Is it better to have two stacks?
Author: Hubert Lamontagne Date: 2016-06-15 19:15
Agner wrote:
Hubert Lamontagne wrote:
The other reason I can think of to have user-accessible stack is supporting coroutines.
If coroutines are implemented with a call stack, it would be filled up. Are you thinking about removing an entry from the call stack before calling another routine? The same functionality can be obtained with a tail call or a state machine.

The next version of my proposal will have a calling convention that makes tail calls possible in all cases. In the rare case that there are more than 16 integer parameters or 16 vector parameters to a function, the remaining parameters will not be stored on the stack but in a list pointed to by a register. The same method applies to a variable arguments list.

Well, within a high-level language engine you can completely bypass the normal function calling, which lets you replace all execution flow with a state machine or do tail calls for functional programming yes. I was thinking of more like the context of something like C++ or ASM, where you're stuck with calling conventions and where coroutine handling libraries often allocate a separate stack for each coroutine using malloc(), then swap around the stack pointer.

As for the cases where you have more than 16 integer or vector/float parameters, how do you allocate that list for extra parameters? Isn't the simplest way to do that to allocate it on the stack? Or would you call malloc()?

   
Is it better to have two stacks?
Author: Agner Date: 2016-06-15 23:43
If you want multiple stacks for coroutines, why not use multiple threads? Anyway, a coroutine switch would work much the same as a thread switch does in the operating system.

Hubert Lamontagne wrote:

As for the cases where you have more than 16 integer or vector/float parameters, how do you allocate that list for extra parameters? Isn't the simplest way to do that to allocate it on the stack? Or would you call malloc()?
The caller can put the list anywhere it wants. Possibly on the stack. My point is that the called function will not address its parameters relative to the stack pointer. We get rid of the old discussion about which order of parameters on the stack is more logical and who should clean up the stack.
   
Is it better to have two stacks?
Author: Hubert Lamontagne Date: 2016-06-16 14:22
Agner wrote:
If you want multiple stacks for coroutines, why not use multiple threads? Anyway, a coroutine switch would work much the same as a thread switch does in the operating system.
Yeah, the catch is that a thread switch goes through the thread scheduler. That's totally fine if you're doing something that doesn't need any tight timing (ex: server software), but it's a problem in very real time apps - I'm thinking of the case of trying to write a video game where every object gets a time slice on every frame, you end up with hundreds of threads and if the OS decides to put even just one of them in the freezer for a few dozen ms, you get frame stutter... Though TBH games aren't normally written this way anyways (plus it makes saving/restoring the game state hard).
   
Is it better to have two stacks?
Author: Agner Date: 2016-06-16 23:44
Hubert Lamontagne wrote:
the catch is that a thread switch goes through the thread scheduler. That's totally fine if you're doing something that doesn't need any tight timing (ex: server software), but it's a problem in very real time apps
You are right. We need to think about real time applications. Is there anything special we need to do to support real time operating systems? It would be nice to be able to reserve one or more CPU cores in a multicore processor for a particular critical task, but that is an operating system issue. The hardware just needs to support fast task switching. We can have extremely fast task switching if the CPU has multiple on-chip memory maps that it can switch between.
   
Is it better to have two stacks?
Author: Hubert Lamontagne Date: 2016-06-17 10:07
Agner wrote:
Hubert Lamontagne wrote:
the catch is that a thread switch goes through the thread scheduler. That's totally fine if you're doing something that doesn't need any tight timing (ex: server software), but it's a problem in very real time apps
You are right. We need to think about real time applications. Is there anything special we need to do to support real time operating systems? It would be nice to be able to reserve one or more CPU cores in a multicore processor for a particular critical task, but that is an operating system issue. The hardware just needs to support fast task switching. We can have extremely fast task switching if the CPU has multiple on-chip memory maps that it can switch between.
It's a bit of an application-specific thing.

For instance, I'm 99% sure MacOS has a special case in its scheduler for the audio thread, so that when the sound hardware sends a "the sound buffer is 50% empty, fill the other half" interrupt, the sound code runs straight off the interrupt and literally outprioritizes EVERYTHING ELSE (including hardware drivers!), completely disables priority inversion, completely bypasses any sort fair time slice allocator and so forth. If the audio thread falls on a locked mutex, it might even elevate the thread holding that mutex to this crazy "screw everything else" priority - this is necessary to use mutexes or things locked by mutexes like malloc() in the audio thread, otherwise trying to lock the mutex might result in a task switch to a low priority GUI thread which then gets preempted, and then you get an audio glitch.

In Windows, the scheduler is generally designed for networking so you generally get screwed over: at very least, you need to install a special app to get low latency audio (ASIO4ALL), you need to spend half an hour finetuning buffer length, any badly written driver can completely make it impossible (for instance there was an Intel network driver that would periodically lock everything for 5ms), and even when it does work, all other applications get no sound because you have to bypass the sound mixer because it runs at way too low priority (and on way too large blocks). In Linux, afaik this sort of scheduling is possible (although it's not really meant for it), but requires special flags in the thread scheduler priority and in the mutexes (and possibly a kernel patch), and there's a video from Android OS engineers about getting a low latency audio path in Android using that (tdlr version: it took some work and many versions!).

Another case I can think of is game consoles... if you have 8 cores, does it make any sense to run ANYTHING ELSE than the currently played game on 7 of those cores?

   
Is it better to have two stacks?
Author:  Date: 2016-06-22 11:40
With regards to exposing a second stack it is worth considering a recent proposal from Intel term Control-flow enforcement technology: https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf

The principle is simple: require that the instruction immediately following a jump is an ENDBRANCH instruction (a NOP on current systems). There are also a series of extensions to enable the shadow stack to be manipulated by debuggers and the like. It seems like a reasonable model of how someone would go about exposing and managing a second return call stack.

With regards to supporting two operating modes (with and without a second stack). My thoughts here are that it is simply not worth it. If your platform does not support some sort of hardware functionality to prevent these ROP type attacks then compilers will kludge around it on your behalf. Case in point are the stack smashing protection technologies which are present in many compilers. Obviously, system developers feel that the impact on performance is more than offset by the resulting improvements in security.

   
Now on Github
Author: Agner Date: 2016-06-26 03:01
I have now moved this project to Github as some of you recommended. This will facilitate the collective development of software toolchain and hardware implementation.

The name CRISC was taken, so I have changed the name to ForwardCom. It stands for forward compatible computer system.

I have converted the manual to LaTex so that Github can show diffs. The pdf manual is built from many tex files. The pdf manual version 1.02 is here.

New in version 1.02:

  • Security features added.
  • Support for dual stack.
  • Some instructions and formats modified, including more formats for jump and call instructions.
  • System call, system return and trap instructions added.
  • New addressing mode for arrays with bounds checking.
  • Memory management and ABI standards described in more detail.
  • Instruction list in comma separated file instruction_list.csv to be used by assemblers, emulators, debuggers, etc.
  • Object file format defined in file elf_forwardcom.h

This thread has become so long that I have started a new thread here.