Author: Hubert Lamontagne |
Date: 2016-02-23 14:47 |
Add with carry:
SIMD operations can be run with higher instruction latency than integer instructions (which have to run in 1 cycle or else they tend to bottleneck everything else). For instance, VADD has a 3~4 cycle latency on ARM. BIGNUM processing tends to have other longer latency operations like large 64x64->128 multiplications, so you could live with a 3~4+ cycle vector ADC as well. ADC could be a 4 input instruction: operand_a, operand_b, operand_a_of_previous_computation, output_of_previous_computation (instructions with lots of inputs are relatively common on SIMD instruction sets). This can even be chained, and the CPU's register renaming engine can totally take care of adc-to-adc dependencies.Among 'modern' architectures (which I'd define as 'architectures that have at least 1 fast out-of-order implementation'), MIPS doesn't have flags at all, Dec Alpha doesn't have flags at all, PA-RISC has a couple bits in the processor status word but conditional branches don't use those (carry flag strictly for adc/sbc/add*/sub*, multi-step division flag, nullify flag that skips over next instruction), ARM has flags but only some instructions set the flags (CMPS, SUBS, arm32 instructions with the 'S' bit) and it doesn't have partial flag updates, x86 infamously has flag partial updates on every ALU instruction (which means it needs multiple aggressive rename units), POWER has an 8 field condition register with ALU ops optionally updating field 0 and CMP updating a selected field (in addition to a count register), Itanium has the 64 single-bit predicate registers (supposedly the one thing that prevented an Intel team from making an out-of-order Itanium!). So I guess it's a bit of a wash but I don't think flag registers make cpus faster (Alpha didn't need flags to be fast!). Regarding 16-bit instruction size:
Agreed, multiple instruction size is bad unless you have no choice. I'd still argue for a single 4-byte instruction format: lots of fast architectures use it (Alpha, MIPS, PA-RISC, Power, ARM64), instructions with large immediates are rare and they are generally easy to split into multiple instructions. Adding 8-byte and 12-byte instructions doesn't sound like a large increase in complexity, but it is: it means instructions can span more than one cache line (= you need a prefetch buffer = your pipeline becomes at least 1 or 2 cycles longer), the second instruction of an issue group can be located in multiple different positions which means you need more multiplexers (+0, +4, +8 bytes) and this problem increases for every successive instruction (the 4th instruction can be at +0, +4, +8, +12, +16, +20, +24), it adds pipeline stall checks for cases where there are simply too many large instructions and the icache can't keep up. Separate Register Files:
Then it's probably best to have an integer register file, floating point register file, and vector register file yes. Exceptions:
I still think that's spending an awful lot of silicon in parts of the cpu that are the most sensitive to timing, for something that I think isn't going to see any use because it isn't even in C++ aside from intrinsics, it prevents the compiler from reordering SSA (it makes + non associative!), and it can be simulated with a couple extra MIPS-style ops. |