A function may have an unlimited number of input
parameters and an unlimited number of outputs (through
pointers or references or class members or whatever,
all of which could in principle have different masks.
That would add a lot of complexity to the ABI for
something that is rarely needed. I prefer to specify a
mask as an explicit function parameter in the rare
cases that it is needed.
I do not see in which situation a vectorized program would need to pass multiple masks to a function. If we vectorize a loop with a function call inside, for each iteration i, either the function is called (bit mask 1) or not called (bit mask 0). Masks originate from the control flow of the original program. There is only one control flow regardless of the number of function parameters. So a single mask register is sufficient.
So, the total maximum number of input dependencies is
five, but the maximum number of input dependencies at
one stage in the pipeline can be limited to three if
necessary.
The pipeline stage does not make a difference. If you want to sustain an execution rate of three 5-input instructions per clock cycle, you need the bandwidth to read 15 inputs per cycle on average from the register file. It does not matter in which order and at which pipeline stage they are read.
I agree that limiting the number of outputs is even more important. (Register file area scales with the square of the number of ports and power scales with its cube. But for read ports, you can always follow the brute-force approach of duplicating the register file and get a linear scaling.)
Yes, x86 instructions also output flags, but x86 CPUs use tricks to turn them into single-output instructions by having slightly wider physical registers, storing the flags in the physical destination register, and renaming the architectural flags register to the physical destination of the last instruction issued.
But input ports are also a bottleneck. Specializing registers allow to partition this complexity. Most ISAs use different registers for integer and FP, for instance.
making the operation dependent
on the former value of the architectural destination
register defeats the purpose of register renaming:
write-after-read dependencies are still there.
All superscalar processors handle this by renaming the
register so that the input and output use two
different physical registers.
Let's take an example:
Non-masked code
mul_add.d v0, v1, v3, v4
mul_add.d v0, v5, v6, v7
Renaming maps each v0 operand to different physical registers. Assuming all inputs are ready and we have 2 FMA units with latency of 5 cycles, both instructions run in parallel and the code takes 5 cycles. So far so good.
Renaming still maps v0 to different physical registers p0 and p1, but the second FMA now has a data dependency on the older value of v0, which is p0. Assuming a by-the-book OoO scheduler, instructions will be considered as dependent and the result will be only available after 10 cycles.
Now with the 2 µop option, the code becomes after expansion and renaming:
Both FMAs can start immediately and produce their results after 5 cycles. Then the first merge executes in say, 1 cycle, then the second dependent merge. Total latency is 7 cycles.
In any case, renaming was unable to completely reorder the two instructions.
I agree you could have a very smart scheduler that schedules instructions just 5 cycles before the mask becomes available so it is ready just in time. But it would add a lot of complexity in a critical part of the pipeline. Better break down the work into enough simple µops and let the out-of-order engine do the heavy lifting. Then you can spend resources you just saved into wider issue width and higher clocks.
I have not been able to get my hands on any AVX-512
processor yet, so I am not able to test it. If you
have any information on the microarchitecture of these
processors then please give me a link.
At some point Intel had the Knights Landing software optimization guide on their website, but it seems they took it out. :(
About masks, they say:
AVX-512 permits instructions to use masking. The best performance usually happens for instructions that do not use masking. Instructions that mask with zeroing usually have similar performance. Instructions that mask with merging can occasionally perform much slower, since there is an additional dependency introduced to the instruction. Keep in mind performance considerations when selecting the masking mode.
Their formulation suggests they follow the single-instruction approach with naive scheduling. They do not appear to split in two µops after all. But if they write "much slower" they probably mean it...
The takeaway is that the masking mode (zeroing or merging) should be available as early as possible in the pipeline, ideally at decode just by looking at the instruction. And the programmer/compiler should use zeroing whenever possible.