Agner's CPU blog

Posted: **2020-11-22, 4:49:41**

Hello,
are there any plans to test any Intel Cpus based on the Sunny Cove microarchitecture(so Ice lake,tiger lake)?
My notebook has an ice lake cpu and I would be curious to read about the changes vs skylake.
Regards

Posted: **2020-11-22, 6:38:38**

Yes, I would like to test Intel Sunny Cove / Ice Lake and Tiger Lake processors if somebody will give me access to one. I can test it by remote access to a Linux machine.

It looks like it has a higher number of instructions per clock for small loops, while the main instruction fetching still has a bottleneck of 16 bytes of code per clock. https://en.wikichip.org/wiki/intel/micr ... sunny_cove

Posted: **2021-02-12, 20:42:30**

Intel revealed the direction it's taking to right the ship after facing crippling delays with its 10 nanometer CPUs.

At an Architecture Day event in Santa Clara, company executives spoke about the technique being used to develop future processors. Chief among those is what Intel calls Foveros, a new 3D packaging technology in which processors are stacked on top of each other.

Posted: **2021-03-22, 17:37:28**

I have had the opportunity to test a Tiger Lake now.

I can confirm that it has more execution units to allow a maximum throughput of five instructions per clock cycle. This includes a maximum of two memory reads and two memory writes per clock cycle.

The decoder is still limited to 16 bytes of code or four instructions per clock cycle, as it has been for many years. This makes instruction fetching and decoding a serious bottleneck. Fortunately, the micro-operation cache has been increased by 50% to a capacity of more than 2000 instructions. You can reach the maximum throughput of 5 instructions per clock if the critical part of your program fits into the 2000 entry micro-op cache, if you have a mixture of different instructions, and if you have no long dependency chains.

The micro-op cache becomes more and more important as the decoder is limited in throughput while everything else gets higher and higher throughput. It is important to economize the use of the micro-op cache by avoiding loop unrolling and by keeping critical parts of the code together in a block of less than 2000 instructions.

I have not had the chance to test an Ice Lake yet, but it seems to be quite similar to the Tiger Lake.

I have discovered a new and interesting feature that is not mentioned in any document from Intel or elsewhere, that I can find. It can forward the value of a memory write to a subsequent memory read with zero latency in some cases. This appears to be somewhat similar to the behavior of AMD Zen2, as I have described here.

But there are important differences between the Intel Tiger Lake and the AMD Zen2. The Zen2 tries to predict whether memory operands have the same address, and forwards data accordingly. The Zen2 has a penalty if this prediction turns out to be wrong. I have not observed any such misprediction penalties on the Tiger Lake. It is likely that the fast forwarding on Tiger Lake is based on calculated addresses rather than on prediction. Write operations on the Tiger Lake are split into two micro-ops, where the first micro-op calculates the address while the second micro-op stores data to the calculated address. This may allow the processor to detect matching addresses before the data to write are available. The fast forwarding works with 8 bit, 32 bit, and 64 bit operands, but strangely not with 16 bits. It does not work with vector registers. The address must be divisible by 4. It works with addressing modes with base pointer and offset, but not if the address has an index register or if it is rip-relative.

The processor has full support for 512-bit vector registers with the AVX-512 instruction set, as well as a lot of additional instruction set extensions.

Intel's optimization manual says:

There are still some cases where coding to the Intel AVX-512 instruction set yields lower performance
than when coding to the Intel AVX2 instruction set. Sometimes it is due to microarchitecture artifacts of
longer vectors, in other cases the natural vectors are just not long enough.

It is good that they are honest, but I would not be so reluctant to use 512-bit vector instructions. The 512-bit instructions have a latency of 1 clock for integer instructions and 4 clock for floating point instructions, and a throughput of one or two vector instructions per clock cycle. The clock frequency may be reduced a little by the higher power consumption of 512-bit instructions, but not enough to outweigh the advantage of processing 512 bits at a time.

My general recommendation has always been not to optimize for one particular CPU because that CPU may be obsolete at the time your software reaches the end user. You should rather make optimizations that are likely to work well on future processors.

I have updated my microarchitecture manual and my table of instruction timings with the new information on Tiger Lake.

Posted: **2021-04-06, 19:09:04**

According to your optimization guide, inc and dec cannot be macro fused on Tiger Lake. How do your tests for this look like?

According to my tests (which are available here: https://www.uops.info/html-tp/TGL/DEC_R ... acroFusion), they do macro fuse in the same way as on previous microarchitectures.

Posted: **2021-04-07, 4:49:08**

@Andreas. You are right, thank you.

Posted: **2021-04-08, 12:19:09**

The timing document says VPCLMUL is slower than PCLMUL for YMM. What about ZMM?

PS -- Thank you for your instruction timing document. I use it a lot!

Posted: **2021-04-08, 14:04:03**

@elstar. Same as YMM

Agner's CPU blog

Intel Sunny Cove

Intel Sunny Cove

Re: Intel Sunny Cove

Re: Intel Sunny Cove

Test of Tiger Lake

Re: Intel Sunny Cove

Re: Intel Sunny Cove

Re: Intel Sunny Cove

Re: Intel Sunny Cove