I have had the opportunity to test a Tiger Lake now.
I can confirm that it has more execution units to allow a maximum throughput of five instructions per clock cycle. This includes a maximum of two memory reads and two memory writes per clock cycle.
The decoder is still limited to 16 bytes of code or four instructions per clock cycle, as it has been for many years. This makes instruction fetching and decoding a serious bottleneck. Fortunately, the micro-operation cache has been increased by 50% to a capacity of more than 2000 instructions. You can reach the maximum throughput of 5 instructions per clock if the critical part of your program fits into the 2000 entry micro-op cache, if you have a mixture of different instructions, and if you have no long dependency chains.
The micro-op cache becomes more and more important as the decoder is limited in throughput while everything else gets higher and higher throughput. It is important to economize the use of the micro-op cache by avoiding loop unrolling and by keeping critical parts of the code together in a block of less than 2000 instructions.
I have not had the chance to test an Ice Lake yet, but it seems to be quite similar to the Tiger Lake.
I have discovered a new and interesting feature that is not mentioned in any document from Intel or elsewhere, that I can find. It can forward the value of a memory write to a subsequent memory read with zero latency in some cases. This appears to be somewhat similar to the behavior of AMD Zen2, as I have described
here.
But there are important differences between the Intel Tiger Lake and the AMD Zen2. The Zen2 tries to predict whether memory operands have the same address, and forwards data accordingly. The Zen2 has a penalty if this prediction turns out to be wrong. I have not observed any such misprediction penalties on the Tiger Lake. It is likely that the fast forwarding on Tiger Lake is based on calculated addresses rather than on prediction. Write operations on the Tiger Lake are split into two micro-ops, where the first micro-op calculates the address while the second micro-op stores data to the calculated address. This may allow the processor to detect matching addresses before the data to write are available. The fast forwarding works with 8 bit, 32 bit, and 64 bit operands, but strangely not with 16 bits. It does not work with vector registers. The address must be divisible by 4. It works with addressing modes with base pointer and offset, but not if the address has an index register or if it is rip-relative.
The processor has full support for 512-bit vector registers with the AVX-512 instruction set, as well as a lot of additional instruction set extensions.
Intel's optimization manual says:
There are still some cases where coding to the Intel AVX-512 instruction set yields lower performance
than when coding to the Intel AVX2 instruction set. Sometimes it is due to microarchitecture artifacts of
longer vectors, in other cases the natural vectors are just not long enough.
It is good that they are honest, but I would not be so reluctant to use 512-bit vector instructions. The 512-bit instructions have a latency of 1 clock for integer instructions and 4 clock for floating point instructions, and a throughput of one or two vector instructions per clock cycle. The clock frequency may be reduced a little by the higher power consumption of 512-bit instructions, but not enough to outweigh the advantage of processing 512 bits at a time.
My general recommendation has always been not to optimize for one particular CPU because that CPU may be obsolete at the time your software reaches the end user. You should rather make optimizations that are likely to work well on future processors.
I have updated my microarchitecture manual and my table of instruction timings with the new information on Tiger Lake.