Author: Hubert Lamontagne |
Date: 2016-04-04 21:01 |
Joe Duarte wrote:
Question: Does an ISA really need to specify the number of architectural registers? What would the implications be of not doing so, and having an infinite number of architectural registers like LLVM's IR? It seems like the number of registers is a fiction anyway (I was stunned to discover that x86-64 processors from Intel and AMD have nearly 200 physical registers.) This would make the number of registers implementation-dependent, rather than part of the ISA specification. See Vikram Adve's recent talk at Microsoft: research.microsoft.com/apps/video/default.aspx?id=249344 (The Microsoft Research people must have been having a bad day or something – their questions and comments reveal that they thoroughly misunderstood his ideas.)
This has a cost:
- You can make instructions variable-sized to accomodate different numbers of registers, but this increases branch mispredict penalty and makes it hard to run multiple instructions at the same time.
- What if your cpu has 32 registers and the program uses the 33rd? You can spill values to memory, but then the CPU has to figure out that it doesn't conflict with any other memory reads/writes.
- More registers = instructions become larger. MIPS instructions would take 42bits instead of 32bits if you had 1024 registers instead of 32.
- Larger register files are slower, which reduces clock rate or causes more stalls due to more latency cycles required to get register values.
Question 2: Let's assume that we have registers R0 - R31. Might it be useful to also have an unspecified register – call it Rx – that basically tells the CPU "give me whatever register you have – I don't care which one". I can imagine some scenarios where this might be useful for a compiler. And it seems to fit with the reality of register renaming anyway.
Okay, you can write a result to "Rx", but how do you find which register that "Rx" is once you want to read back your result and do something with it? What if you write to multiple "Rx"'es, how do you keep track of what went where?------------------------------------ Agner wrote:
Yes, this is a complication. I want to
handle it in software without any complex
instructions. If the size of the saved register image
is not guaranteed to be a multiple of the stack word
size then I would first calculate the amount of space
needed for all the vector registers I want to save,
then save the stack pointer to another register, then
subtract the necessary size from the stack pointer,
then align the stack by 8 ( AND SP,-8 ), then use a
temporary register as pointer to the save area, and
then save the registers, incrementing the pointer each
time. The restore process is easier.
Oh, I see. You'd use some kind of massive "vector store/load (including size prefix byte)" instruction that's basically never aligned to save all the vectors, then reestablish stack alignment. And for C++ ABI, you'd force all caller functions to save all SIMD vectors and floats and doubles, and use the caller's knowledge of what's in the registers to do a simpler aligned non-variable-sized save (instead of a massive unaligned variable-sized save)... On paper it works, but for some reason I find that rather scary.... :3For instance, if you have a function working on a bunch of scalar floats, and then it calls some sub-function (say, "sqrt()" or something like that), won't it have to spill every single float register that it's working on unto the stack (potentially on every iteration of a loop)?
Generally, 8-bit and 16-bit vector multiplications are
provided in SIMD instruction sets to do stuff like
movie decoding and software rendering
x86 does not have a complete set of
vector multiply instructions. NEON has 8, 16, and 32
bits. I don't see what you would you need 8-bit vector
multiply for?
I'm thinking of the case of 32bpp RGBA bilinear interpolation/texture mapping/rotozoom, along with alpha blending, for the cases where you can't use OpenGL (for instance in Macromedia Flash). That's a notorious CPU pipeline buster, because the algo eats up a ton of small multiplications. |