Agner wrote:
Thank you for your comments. It is nice to have
sparring partners to discuss with. We are really
getting somewhere.
Ha yeah. "Sparring partner" is a great way to put it. :3
I think there are many advantages
to storing the vector length in the vector register
itself. This makes it much cheaper to save callee-save
registers: You only have to save the part of the
register that is actually used. I plan to make special
instructions for save/restore. These instructions will
save or restore the length and as much of the data as
is actually used. The save instruction can increment a
pointer by the actual amount of space used. The
restore instruction will have to be followed by an
extra instruction to increment or decrement a pointer
by the amount of space used. The same instruction can
be used to adjust the stack pointer before saving.
This will save a lot of data cache. If a very long
vector register actually contains only a floating
point scalar then we need not save any more than that.
If the vector register is unused, which will probably
happen more than 50% of the time, then we only need to
save the length, which is zero. Also, we don't need a
complicated lazy save mechanism for task switches. And
we get one less input dependence because we don't need
a separate register for vector length.
One catch is that vector registers need special alignment. For instance,
if your vector regs are 512bits and your DCache width is 512bits, you
want your vectors to be 512bit aligned when saved to the stack, so you
need to 512-align your stack pointer. Also, as the vector register size
increases, the stack size allocated to applications has to grow because
an app that was tested and known to work with 128bit vector saves/restores
might cause stack overflows if vectors become 512bits!In light of this, I'd suggest:
- Scalar floats should get their own 32/64bit registers. In some C++
programs (games, audio, etc) they see lots of use and are often passed
to/from functions, so they need to be easy to save/restore. Since they have
the same 4/8byte alignment as 32/64bit ints, this is very easy to do if
you mandate an 8byte aligned SP in the calling convention. - In fact, I'd even suggest a second, different stack for vector register
saves/restore. This way, the CPU can keep the VSP (vector stack pointer)
in perfect alignment all the time without having to do dynamic realignment
on the regular SP, and it can have special instructions that adjust the VSP
the right way for each vector, and the OS can grow the vector stack to make
sure that a program that expects 128-bit vector alignment will never generate
stack overflows when alignment grows to 256-bit+ etc... I see the whole sequence for saving multiple vectors going something like this: SUB sp, 12
; special instruction that gets the number of empty bytes to adjust vsp so that
; the next vector save is aligned and can be done in a single DCache cycle
GETVECTORSTOREPREPAD r0, vsp, v0.size
ST r0, [sp + 0]
ADD vsp, r0
ST v0, [vsp]
GETVECTORSTOREPREPAD r1, vsp, v1.size
ST r1, [sp + 4]
ADD vsp, r1
ST v1, [vsp]
GETVECTORSTOREPREPAD r2, vsp, v2.size
ST r2, [sp + 8]
ADD vsp, r2
ST v2, [vsp] The whole sequence can probably be streamlined somwhat but I
hope you get what I mean here.
The disadvantage is, of course, that the compiler needs to
distinguish between saving the data of a vector and
saving everything for a later complete restore. Once
we have this distinction, however, there is little
extra cost to also saving a few carry flags.
There's an extra cost, it's just in a different non-obvious place:
it forces the compiler to figure out if the carry bits are relevant
for each operation in a chain, and if the compiler can't figure it
out it will output less efficient. Whereas if carry flags are in their
own register, and only written/read by some operations (on ARM,
when SUB generates flags it is called SUBS and is a different
instruction), then the compiler only ever has to worry about carry
flags for instructions that expressly read/write them (ADDS / ADC /
ADCS / SUBS / SBC / SBCS / CMPS etc), and then it just
becomes one extra register pool in the compiler's register allocator.The use of separate flag registers that only get written to by a limited
subset of operations is also good for CPUs, because then it needs
a much, much less aggressive unit for handling flag register writes/reads
since it only needs to handle 1 write/read per cycle, instead of like
3/4+ like on x86. The idea of having
separate registers for system code only, sounds good.
Would eight 64-bit registers be a suitable number?
8 registers for general purpose OS usage sounds good. It can probably
be all combined in an instruction for reading/writing CPU control registers,
like the CR0..CR15, DR0..DR15, MXCSR, MSW, GDTR, IDTR, LDTR, TR,
CS, SS, DS, ES, FS, GS, and floating point control registers on x64.Some system operations could also be implemented as reads/writes to system
registers: CPUID could be replaced by a bunch of read-only system registers
that give the CPU model for instance. Having lots of potential system registers available would probably help with
limiting the need to add new system opcodes as new features are added. Joe Duarte wrote:
Hubert and Agnes, I have a related but slightly
tangential question, since we're talking about TLBs.
In our applications, within a virtual memory space,
why do we still use conventional -- and enormous --
memory addresses? We use 64-bit (or 48 actual I
think on x64). Since it's a program's exclusive
virtual memory space, a universe of our own, why can't
we use arbitrary and much, much smaller addresses?
Like 1, 2, 3, etc., basically a byte or two. If the
TLB is going to have to translate anyway, is it a
problem to translate small addresses into conventional
physical ones? Processes could have unique IDs - one
byte would be ample most of the time - so the TLB
would have a unique identifier for all addresses
(process ID + the process' internal small memory
addresses), and it would often take only two or three
bytes total. You could do a variety of things with
pages and their sizes in that scenario. What am I
missing here? I usually stick to managed
languages. ARM and x86 handle this by having a 32-bit mode. :3You could definitively argue for going for a 32-bit
architecture with a standard 64-bit extension like what
Risc-V does. This makes total sense for an architecture
that's going to be used in embedded devices. Even MIPS falls in this case: ultimately, it was used in way
more PS2's, set-top boxes and so forth than in servers.
Because of this, its 32bit support was a good thing
(and you can even argue that the infamous branch-delay-slot
was actually GOOD for MIPS). ---- Anyhow, the issue here I guess is that if your app has pointers
smaller than 4-bytes, well: - 3-bytes can't be aligned to cache lines. If you have an array
of pointers, eventually one of the pointers has to straddle
2 cache lines and that's a much larger penalty than some
wasted DCache due to using 4-byte pointers. - 3-bytes is only a 16 Mb address space, and few languages
want to be limited to something that small. You cannot start
with 3-byte pointer program and then dynamically upgrade
everything to 4-bytes if you ever run out of space. Might as
well make everything 4-byte from the start. - 2-bytes is only 64 kb. Sure, you could have 2-byte pointers,
then have all sorts of data zone selectors to grow out of that:
this is exactly how 16-bit x86 works, and programs just totally
outgrew this. Even the late generation of DOS games are way
too large for this and do everything in 32bits. Anyhow, this is also why the TLB is 2 levels on 32bit x86: a lot
of apps are only going to use a few hundred kilobytes of ram,
so it makes sense to only have page tables for small sections
of ram. On the top level page table, all the unused 4mb
blocks-of-pages are simply marked as invalid, and extra page
tables are allocated as the program's memory usage grows. Separately, I think it would be fruitful
to think about how an ISA could be designed to help
JITs, and common parsing workloads like the web. If we
designed an ISA to increase the performance of browser
JS engines,. NET and Java VMs, etc. what would that
look like? Would it be instruction selection that
mattered most, or other aspects of the architecture?
Intel rolled out some great instructions for string
parsing in SSE 4.2, but my impression is that hardly
any developers know about them or use then (I couldn't
find them in nginx or Apache source, or browsers I
checked, and they could really benefit.) That raises a
separate issue in getting developers to pay attention
to ISAs and optimization wins...
Unfortunately, this kind of higher-level stuff typically has all
sorts of special cases and exceptions that make them
impossible to implement in fast circuits.For instance, JS interpreters often have the trick of having
numbers as 64bit floats, and other types as an invalid
64bit float with a pointer in it. Well, now instead of just
adding two floats together like in C++, you first have to decide
whether one of your floats is actually a string and you have
to trigger string to double conversion. Or worse even, in some languages it can be some type with
an overloaded conversion to double, and the conversion code
can call all sorts of functions generating all sorts of side effects
(like printing to the console), which means that the compiler
can't even change the order of two additions to make things
faster because it can introduce new bugs. String parsing also falls in this kind of case. For instance,
checking the length of a string is not so obvious: often the
interpreter knows the number of bytes of a string, but some of
those bytes are UTF-8 characters, and if you account for special
cases (like ending up with Latin-1 aka CP-1252 text), then there's
often really no alternative to just looping through the string byte
per byte, in which case it's impossible to be faster than x86 / ARM /
MIPS even though those chips have no special support for this. This is also why so few higher level languages support multiple
threads / multi-processor : there's just so many weird states that
the interpreter can be in, and there are just so many weird possible
side-effects, that you can't let a second core go through that data -
it would just create all sorts of nasty race conditions and so forth. ----- Java and .NET are in a lesser class of badness, in that a Java
float is just a float and you can add it normally, and you can easily
run Java multi-threaded (and it has a nice way off adding mutexes
to objects). But then again, Java benefits way less from special
architectures - if your Java program can't run fast on x86, then it
can't run fast on anything! The real cost of Java is the Garbage Collector, which makes it
impossible to avoid garbage-collection pauses, which are often
longer than 100ms, or even occasionally longer than 300ms.
This is perfectly fine for server software (and Java is an enterprise
language so it fits this perfectly), but it makes Java unsuitable for
software that can't really have pauses like games and heavy-duty
user-facing GUI software (Photoshop, C++ IDEs, Word, iPhone
apps etc). This cannot be solved by processor design.
Agner wrote:
If we don't need extra carry bits for high precision
arithmetics, then the only need for these extra bits
is for detecting and propagating integer overflow.
If we don't have the extra carry flags, then we
will need special instructions for checking if an
addition, multiplication, etc. overflows. After each
addition or other arithmetic instruction, issue
another instruction with the same inputs just to check
if the operation overflows. The outputs of all the
overflow checks for a series of calculations should
then be OR'ed together. The total number of
instructions will be three times the number of
instructions needed without overflow check.
So what are the costs and benefits? Without the extra
flag bits we need three times as many instructions if
we want to check for integer overflow. We can avoid
overflow in many cases by using 64-bit integers, but
even 64-bit integers can easily overflow if the
calculation has many multiplications. Compiler support
for checking integer overflow is rare, unfortunately,
so any mechanism for detecting integer overflow will
perhaps not be used much. On the other hand, if an
efficient method was available, we would probably see
compilers that support it and programmers that use it.
In fact, many programmers are frustrated over how
difficult it is to detect signed integer overflow.
Traps for integer overflow exception is not good for
vector code. One solution is to use floating point
calculations instead of integer, but I don't see that
solution used much. Floating point calculations have
longer latencies, of course. No solution seems to be
really good here.
Dunno, I write C++ sound code, and for code with lots
of multiplications that could overflow, I use floating point
code _all the time_. :3Floating point does have higher latency, but that's true of
any code that uses lots of multiplication. Float multiplication
also has the really nice property that you need a lot less
shifts and clamps than fixed point code, so for algos with tons
of multiplications (say, FFT for instance), floating-point
is really really useful. If I need to do multiplications in integer code (for instance,
in a FM synthesizer - can't use float because converting
float to int to use as array indexes is slow), then what I
generally do is that I just make sure my * operands are
essentially 16-bits (or, say, one operand is 24-bits and the
other is 8-bits), and carefully use lots of >> bitshifts to keep
everything in range. For sound generation, this is generally
fine because the human ear isn't that 'precise' anyways
(in terms of bits). Another special case was writing ARM NEON code, which had
lackluster floating-point and only a 16x16 multiplier in the hardware
(32x32 multiplies had a significant penalty). So I used a lot of a
particular opcode called VQDMULH.s16 - 16x16->16 vector
signed doubling multiply keeping the top part of the result and
clamping the special case of -32768* -32768 to 32767 - equivalent to this: res = (int16_t)a * (int16_t)b;
res = res >> 15;
if(res == 32768)
res = 32767; // clamp the particular case of -32768 x -32768
(int16_t)out = res ---- That being said, I have had to use 32x32->64 wide multiplication
in code 2 or 3 times. But I think it's a special case, compared to
the 32bit float multiplication, which I use all over the place! |