Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Do we need instructions with two outputs?
Author: Hubert Lamontagne Date: 2016-04-02 14:13
Agner wrote:
Thank you for your comments. It is nice to have sparring partners to discuss with. We are really getting somewhere.
Ha yeah. "Sparring partner" is a great way to put it. :3

I think there are many advantages to storing the vector length in the vector register itself. This makes it much cheaper to save callee-save registers: You only have to save the part of the register that is actually used. I plan to make special instructions for save/restore. These instructions will save or restore the length and as much of the data as is actually used. The save instruction can increment a pointer by the actual amount of space used. The restore instruction will have to be followed by an extra instruction to increment or decrement a pointer by the amount of space used. The same instruction can be used to adjust the stack pointer before saving. This will save a lot of data cache. If a very long vector register actually contains only a floating point scalar then we need not save any more than that. If the vector register is unused, which will probably happen more than 50% of the time, then we only need to save the length, which is zero. Also, we don't need a complicated lazy save mechanism for task switches. And we get one less input dependence because we don't need a separate register for vector length.

One catch is that vector registers need special alignment. For instance,
if your vector regs are 512bits and your DCache width is 512bits, you
want your vectors to be 512bit aligned when saved to the stack, so you
need to 512-align your stack pointer. Also, as the vector register size
increases, the stack size allocated to applications has to grow because
an app that was tested and known to work with 128bit vector saves/restores
might cause stack overflows if vectors become 512bits!

In light of this, I'd suggest:
- Scalar floats should get their own 32/64bit registers. In some C++
programs (games, audio, etc) they see lots of use and are often passed
to/from functions, so they need to be easy to save/restore. Since they have
the same 4/8byte alignment as 32/64bit ints, this is very easy to do if
you mandate an 8byte aligned SP in the calling convention.

- In fact, I'd even suggest a second, different stack for vector register
saves/restore. This way, the CPU can keep the VSP (vector stack pointer)
in perfect alignment all the time without having to do dynamic realignment
on the regular SP, and it can have special instructions that adjust the VSP
the right way for each vector, and the OS can grow the vector stack to make
sure that a program that expects 128-bit vector alignment will never generate
stack overflows when alignment grows to 256-bit+ etc...

I see the whole sequence for saving multiple vectors going something like this:

SUB sp, 12
; special instruction that gets the number of empty bytes to adjust vsp so that
; the next vector save is aligned and can be done in a single DCache cycle
GETVECTORSTOREPREPAD r0, vsp, v0.size
ST r0, [sp + 0]
ADD vsp, r0
ST v0, [vsp]
GETVECTORSTOREPREPAD r1, vsp, v1.size
ST r1, [sp + 4]
ADD vsp, r1
ST v1, [vsp]
GETVECTORSTOREPREPAD r2, vsp, v2.size
ST r2, [sp + 8]
ADD vsp, r2
ST v2, [vsp]

The whole sequence can probably be streamlined somwhat but I
hope you get what I mean here.

The disadvantage is, of course, that the compiler needs to distinguish between saving the data of a vector and saving everything for a later complete restore. Once we have this distinction, however, there is little extra cost to also saving a few carry flags.

There's an extra cost, it's just in a different non-obvious place:
it forces the compiler to figure out if the carry bits are relevant
for each operation in a chain, and if the compiler can't figure it
out it will output less efficient. Whereas if carry flags are in their
own register, and only written/read by some operations (on ARM,
when SUB generates flags it is called SUBS and is a different
instruction), then the compiler only ever has to worry about carry
flags for instructions that expressly read/write them (ADDS / ADC /
ADCS / SUBS / SBC / SBCS / CMPS etc), and then it just
becomes one extra register pool in the compiler's register allocator.

The use of separate flag registers that only get written to by a limited
subset of operations is also good for CPUs, because then it needs
a much, much less aggressive unit for handling flag register writes/reads
since it only needs to handle 1 write/read per cycle, instead of like
3/4+ like on x86.

The idea of having separate registers for system code only, sounds good. Would eight 64-bit registers be a suitable number?

8 registers for general purpose OS usage sounds good. It can probably
be all combined in an instruction for reading/writing CPU control registers,
like the CR0..CR15, DR0..DR15, MXCSR, MSW, GDTR, IDTR, LDTR, TR,
CS, SS, DS, ES, FS, GS, and floating point control registers on x64.

Some system operations could also be implemented as reads/writes to system
registers: CPUID could be replaced by a bunch of read-only system registers
that give the CPU model for instance.

Having lots of potential system registers available would probably help with
limiting the need to add new system opcodes as new features are added.

Joe Duarte wrote:

Hubert and Agnes, I have a related but slightly tangential question, since we're talking about TLBs.

In our applications, within a virtual memory space, why do we still use conventional -- and enormous -- memory addresses?

We use 64-bit (or 48 actual I think on x64). Since it's a program's exclusive virtual memory space, a universe of our own, why can't we use arbitrary and much, much smaller addresses? Like 1, 2, 3, etc., basically a byte or two. If the TLB is going to have to translate anyway, is it a problem to translate small addresses into conventional physical ones? Processes could have unique IDs - one byte would be ample most of the time - so the TLB would have a unique identifier for all addresses (process ID + the process' internal small memory addresses), and it would often take only two or three bytes total.

You could do a variety of things with pages and their sizes in that scenario. What am I missing here? I usually stick to managed languages.

ARM and x86 handle this by having a 32-bit mode. :3

You could definitively argue for going for a 32-bit
architecture with a standard 64-bit extension like what
Risc-V does. This makes total sense for an architecture
that's going to be used in embedded devices.

Even MIPS falls in this case: ultimately, it was used in way
more PS2's, set-top boxes and so forth than in servers.
Because of this, its 32bit support was a good thing
(and you can even argue that the infamous branch-delay-slot
was actually GOOD for MIPS).

----

Anyhow, the issue here I guess is that if your app has pointers
smaller than 4-bytes, well:

- 3-bytes can't be aligned to cache lines. If you have an array
of pointers, eventually one of the pointers has to straddle
2 cache lines and that's a much larger penalty than some
wasted DCache due to using 4-byte pointers.

- 3-bytes is only a 16 Mb address space, and few languages
want to be limited to something that small. You cannot start
with 3-byte pointer program and then dynamically upgrade
everything to 4-bytes if you ever run out of space. Might as
well make everything 4-byte from the start.

- 2-bytes is only 64 kb. Sure, you could have 2-byte pointers,
then have all sorts of data zone selectors to grow out of that:
this is exactly how 16-bit x86 works, and programs just totally
outgrew this. Even the late generation of DOS games are way
too large for this and do everything in 32bits.

Anyhow, this is also why the TLB is 2 levels on 32bit x86: a lot
of apps are only going to use a few hundred kilobytes of ram,
so it makes sense to only have page tables for small sections
of ram. On the top level page table, all the unused 4mb
blocks-of-pages are simply marked as invalid, and extra page
tables are allocated as the program's memory usage grows.

Separately, I think it would be fruitful to think about how an ISA could be designed to help JITs, and common parsing workloads like the web. If we designed an ISA to increase the performance of browser JS engines,. NET and Java VMs, etc. what would that look like? Would it be instruction selection that mattered most, or other aspects of the architecture? Intel rolled out some great instructions for string parsing in SSE 4.2, but my impression is that hardly any developers know about them or use then (I couldn't find them in nginx or Apache source, or browsers I checked, and they could really benefit.) That raises a separate issue in getting developers to pay attention to ISAs and optimization wins...

Unfortunately, this kind of higher-level stuff typically has all
sorts of special cases and exceptions that make them
impossible to implement in fast circuits.

For instance, JS interpreters often have the trick of having
numbers as 64bit floats, and other types as an invalid
64bit float with a pointer in it. Well, now instead of just
adding two floats together like in C++, you first have to decide
whether one of your floats is actually a string and you have
to trigger string to double conversion.

Or worse even, in some languages it can be some type with
an overloaded conversion to double, and the conversion code
can call all sorts of functions generating all sorts of side effects
(like printing to the console), which means that the compiler
can't even change the order of two additions to make things
faster because it can introduce new bugs.

String parsing also falls in this kind of case. For instance,
checking the length of a string is not so obvious: often the
interpreter knows the number of bytes of a string, but some of
those bytes are UTF-8 characters, and if you account for special
cases (like ending up with Latin-1 aka CP-1252 text), then there's
often really no alternative to just looping through the string byte
per byte, in which case it's impossible to be faster than x86 / ARM /
MIPS even though those chips have no special support for this.

This is also why so few higher level languages support multiple
threads / multi-processor : there's just so many weird states that
the interpreter can be in, and there are just so many weird possible
side-effects, that you can't let a second core go through that data -
it would just create all sorts of nasty race conditions and so forth.

-----

Java and .NET are in a lesser class of badness, in that a Java
float is just a float and you can add it normally, and you can easily
run Java multi-threaded (and it has a nice way off adding mutexes
to objects). But then again, Java benefits way less from special
architectures - if your Java program can't run fast on x86, then it
can't run fast on anything!

The real cost of Java is the Garbage Collector, which makes it
impossible to avoid garbage-collection pauses, which are often
longer than 100ms, or even occasionally longer than 300ms.
This is perfectly fine for server software (and Java is an enterprise
language so it fits this perfectly), but it makes Java unsuitable for
software that can't really have pauses like games and heavy-duty
user-facing GUI software (Photoshop, C++ IDEs, Word, iPhone
apps etc). This cannot be solved by processor design.


Agner wrote:

If we don't need extra carry bits for high precision arithmetics, then the only need for these extra bits is for detecting and propagating integer overflow.

If we don't have the extra carry flags, then we will need special instructions for checking if an addition, multiplication, etc. overflows. After each addition or other arithmetic instruction, issue another instruction with the same inputs just to check if the operation overflows. The outputs of all the overflow checks for a series of calculations should then be OR'ed together. The total number of instructions will be three times the number of instructions needed without overflow check.

So what are the costs and benefits? Without the extra flag bits we need three times as many instructions if we want to check for integer overflow. We can avoid overflow in many cases by using 64-bit integers, but even 64-bit integers can easily overflow if the calculation has many multiplications. Compiler support for checking integer overflow is rare, unfortunately, so any mechanism for detecting integer overflow will perhaps not be used much. On the other hand, if an efficient method was available, we would probably see compilers that support it and programmers that use it. In fact, many programmers are frustrated over how difficult it is to detect signed integer overflow. Traps for integer overflow exception is not good for vector code. One solution is to use floating point calculations instead of integer, but I don't see that solution used much. Floating point calculations have longer latencies, of course. No solution seems to be really good here.

Dunno, I write C++ sound code, and for code with lots
of multiplications that could overflow, I use floating point
code _all the time_. :3

Floating point does have higher latency, but that's true of
any code that uses lots of multiplication. Float multiplication
also has the really nice property that you need a lot less
shifts and clamps than fixed point code, so for algos with tons
of multiplications (say, FFT for instance), floating-point
is really really useful.

If I need to do multiplications in integer code (for instance,
in a FM synthesizer - can't use float because converting
float to int to use as array indexes is slow), then what I
generally do is that I just make sure my * operands are
essentially 16-bits (or, say, one operand is 24-bits and the
other is 8-bits), and carefully use lots of >> bitshifts to keep
everything in range. For sound generation, this is generally
fine because the human ear isn't that 'precise' anyways
(in terms of bits).

Another special case was writing ARM NEON code, which had
lackluster floating-point and only a 16x16 multiplier in the hardware
(32x32 multiplies had a significant penalty). So I used a lot of a
particular opcode called VQDMULH.s16 - 16x16->16 vector
signed doubling multiply keeping the top part of the result and
clamping the special case of -32768* -32768 to 32767 - equivalent to this:

res = (int16_t)a * (int16_t)b;
res = res >> 15;
if(res == 32768)
res = 32767; // clamp the particular case of -32768 x -32768
(int16_t)out = res

----

That being said, I have had to use 32x32->64 wide multiplication
in code 2 or 3 times. But I think it's a special case, compared to
the 32bit float multiplication, which I use all over the place!

 
thread Proposal for an ideal extensible instruction set new - Agner - 2015-12-27
replythread Itanium new - Ethan - 2015-12-28
last reply Itanium new - Agner - 2015-12-28
replythread Proposal for an ideal extensible instruction set new - hagbardCeline - 2015-12-28
last replythread Proposal for an ideal extensible instruction set new - Agner - 2015-12-28
reply Proposal for an ideal extensible instruction set new - Adrian Bocaniciu - 2016-01-04
reply Proposal for an ideal extensible instruction set new - Adrian Bocaniciu - 2016-01-04
reply Proposal for an ideal extensible instruction set new - Adrian Bocaniciu - 2016-01-04
replythread Proposal for an ideal extensible instruction set new - Adrian Bocaniciu - 2016-01-04
reply Proposal for an ideal extensible instruction set new - Adrian Bocaniciu - 2016-01-05
replythread Proposal for an ideal extensible instruction set new - John D. McCalpin - 2016-01-05
last reply Proposal for an ideal extensible instruction set new - Adrian Bocaniciu - 2016-01-06
last reply Proposal for an ideal extensible instruction set new - Ook - 2016-01-05
last reply Proposal for an ideal extensible instruction set new - acppcoder - 2016-03-27
reply Proposal for an ideal extensible instruction set new - Jake Stine - 2016-01-11
replythread Proposal for an ideal extensible instruction set new - Agner - 2016-01-12
last replythread Proposal for an ideal extensible instruction set new - Jonathan Morton - 2016-02-02
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-02-03
last replythread Proposal for an ideal extensible instruction set new - Jonathan Morton - 2016-02-12
last replythread Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-02-18
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-02-21
last replythread Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-02-22
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-02-23
replythread Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-02-23
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-02-24
last replythread Proposal for an ideal extensible instruction set new - asdf - 2016-02-24
last reply Proposal for an ideal extensible instruction set new - Agner - 2016-02-24
last reply Proposal for an ideal extensible instruction set new - Agner - 2016-02-25
replythread limit instruction length to power of 2 new - A-11 - 2016-02-24
last replythread limit instruction length to power of 2 new - Agner - 2016-02-24
replythread Any techniques for more than 2 loads per cycle? new - Hubert Lamontagne - 2016-02-24
last reply Any techniques for more than 2 loads per cycle? new - Agner - 2016-02-25
last replythread limit instruction length to power of 2 new - A-11 - 2016-02-25
last reply limit instruction length to power of 2 new - Hubert Lamontagne - 2016-02-25
replythread More ideas new - Agner - 2016-03-04
replythread More ideas new - Hubert Lamontagne - 2016-03-07
last reply More ideas new - Agner - 2016-03-08
last reply More ideas new - Agner - 2016-03-09
replythread Proposal for an ideal extensible instruction set new - Joe Duarte - 2016-03-07
reply Proposal for an ideal extensible instruction set new - Agner - 2016-03-08
last replythread Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-03-08
last replythread Proposal for an ideal extensible instruction set new - Joe Duarte - 2016-03-09
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-03-10
last replythread Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-03-11
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-03-11
last replythread Proposal for an ideal extensible instruction set new - anon2718 - 2016-03-13
last reply Proposal for an ideal extensible instruction set new - Agner - 2016-03-14
replythread A design without a TLB new - Agner - 2016-03-11
replythread A design without a TLB new - Hubert Lamontagne - 2016-03-11
reply A design without a TLB new - Agner - 2016-03-11
last reply A design without a TLB new - Agner - 2016-03-12
reply A design without a TLB new - Bigos - 2016-03-13
last reply A design without a TLB new - Agner - 2016-03-28
replythread Proposal now published new - Agner - 2016-03-22
last replythread Proposal now published new - Hubert Lamontagne - 2016-03-23
last replythread Proposal now published new - Agner - 2016-03-24
last replythread Proposal now published new - Hubert Lamontagne - 2016-03-24
last replythread Proposal now published new - Agner - 2016-03-24
last replythread Proposal now published new - Hubert Lamontagne - 2016-03-24
last replythread Proposal now published new - Agner - 2016-03-25
last replythread Proposal now published new - Hubert Lamontagne - 2016-03-28
last replythread Proposal now published new - Agner - 2016-03-29
last replythread Proposal now published new - Hubert Lamontagne - 2016-03-30
last replythread Proposal now published new - Agner - 2016-03-30
last replythread Do we need instructions with two outputs? new - Agner - 2016-03-31
last replythread Do we need instructions with two outputs? new - Hubert Lamontagne - 2016-04-01
reply Do we need instructions with two outputs? new - Agner - 2016-04-01
replythread Do we need instructions with two outputs? new - Joe Duarte - 2016-04-02
last replythread Do we need instructions with two outputs? new - Agner - 2016-04-02
last reply Do we need instructions with two outputs? new - Joe Duarte - 2016-04-02
last replythread Do we need instructions with two outputs? new - Agner - 2016-04-02
last replythread Do we need instructions with two outputs? - Hubert Lamontagne - 2016-04-02
last replythread Do we need instructions with two outputs? new - Agner - 2016-04-03
reply Do we need instructions with two outputs? new - Joe Duarte - 2016-04-03
last replythread Do we need instructions with two outputs? new - Hubert Lamontagne - 2016-04-03
last replythread Do we need instructions with two outputs? new - Agner - 2016-04-04
reply Do we need instructions with two outputs? new - Hubert Lamontagne - 2016-04-04
last replythread Do we need instructions with two outputs? new - Joe Duarte - 2016-04-06
last replythread Do we need instructions with two outputs? new - Hubert Lamontagne - 2016-04-07
last replythread Do we need instructions with two outputs? new - HarryDev - 2016-04-08
last reply Do we need instructions with two outputs? new - Hubert Lamontagne - 2016-04-09
replythread How about stack machine ISA? new - A-11 - 2016-04-10
last replythread treating stack ISA as CISC architecure new - A-11 - 2016-04-14
last replythread treating stack ISA as CISC architecure new - Agner - 2016-04-14
last replythread treating stack ISA as CISC architecure new - A-11 - 2016-04-17
replythread treating stack ISA as CISC architecure new - Hubert Lamontagne - 2016-04-17
last replythread stack ISA versus long vectors new - Agner - 2016-04-18
last replythread stack ISA versus long vectors new - Hubert Lamontagne - 2016-04-19
last reply stack ISA versus long vectors new - Agner - 2016-04-20
last reply treating stack ISA as CISC architecure new - A-11 - 2016-04-18
replythread Proposal for an ideal extensible instruction set new - zboson - 2016-04-11
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-04-11
last replythread Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-04-11
last replythread Proposal for an ideal extensible instruction set new - Agner - 2016-04-12
last reply Proposal for an ideal extensible instruction set new - Hubert Lamontagne - 2016-04-12
replythread Version 1.01 new - Agner - 2016-05-10
last replythread Version 1.01 new - Hubert Lamontagne - 2016-05-13
last replythread Version 1.01 new - Agner - 2016-05-14
last replythread Version 1.01 new - Harry - 2016-06-02
replythread Public repository new - Agner - 2016-06-02
reply Public repository new - Harry - 2016-06-02
last reply Public repository new - Harry - 2016-06-02
last reply Public repository new - Agner - 2016-06-09
replythread Rethinking DLLs and shared objects new - Agner - 2016-05-20
replythread Rethinking DLLs and shared objects new - cv - 2016-05-20
last reply Rethinking DLLs and shared objects new - Agner - 2016-05-20
replythread Rethinking DLLs and shared objects new - Peter Cordes - 2016-05-30
last replythread Rethinking DLLs and shared objects new - Agner - 2016-05-30
last replythread Rethinking DLLs and shared objects new - Joe Duarte - 2016-06-17
last replythread Rethinking DLLs and shared objects new - Agner - 2016-06-18
last reply Rethinking DLLs and shared objects new - Bigos - 2016-06-18
last replythread Rethinking DLLs and shared objects new - Freddie Witherden - 2016-06-02
last replythread Rethinking DLLs and shared objects new - Agner - 2016-06-04
last replythread Rethinking DLLs and shared objects new - Freddie Witherden - 2016-06-04
last reply Rethinking DLLs and shared objects new - Agner - 2016-06-06
replythread Is it better to have two stacks? new - Agner - 2016-06-05
reply Is it better to have two stacks? new - Hubert Lamontagne - 2016-06-07
replythread Is it better to have two stacks? new - Eden Segal - 2016-06-13
last replythread Is it better to have two stacks? new - Agner - 2016-06-13
last replythread Is it better to have two stacks? new - Hubert Lamontagne - 2016-06-14
last replythread Is it better to have two stacks? new - Agner - 2016-06-14
last replythread Is it better to have two stacks? new - Hubert Lamontagne - 2016-06-15
last replythread Is it better to have two stacks? new - Agner - 2016-06-15
last replythread Is it better to have two stacks? new - Hubert Lamontagne - 2016-06-16
last replythread Is it better to have two stacks? new - Agner - 2016-06-16
last reply Is it better to have two stacks? new - Hubert Lamontagne - 2016-06-17
last reply Is it better to have two stacks? new - Freddie Witherden - 2016-06-22
last reply Now on Github new - Agner - 2016-06-26