Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Test results for AMD Ryzen - Agner - 2017-05-02
replythread Ryzen analyze - Daniel - 2017-05-02
last reply Ryzen analyze - Agner - 2017-05-02
replythread Test results for AMD Ryzen - Peter Cordes - 2017-05-02
last replythread Test results for AMD Ryzen - Agner - 2017-05-03
last replythread Test results for AMD Ryzen - Phenominal - 2017-05-06
last replythread Test results for AMD Ryzen - Agner - 2017-05-06
last replythread Test results for AMD Ryzen - Phenominal - 2017-05-06
last reply Test results for AMD Ryzen - Agner - 2017-05-06
replythread Test results for AMD Ryzen - Tacit Murky - 2017-05-05
last reply Test results for AMD Ryzen - Tacit Murky - 2017-07-08
replythread Test results for AMD Ryzen--POPCNT - Xing Liu - 2017-05-08
last reply Test results for AMD Ryzen--POPCNT - Agner - 2017-05-11
replythread Test results for AMD Ryzen - Justin - 2017-07-11
last reply EPYC - Agner - 2017-07-11
last replythread Test results for AMD Ryzen - Lefty - 2017-07-12
last replythread Test results for AMD Ryzen - Agner - 2017-07-12
replythread Test results for AMD Ryzen - cvax - 2017-07-13
last reply Test results for AMD Ryzen - Agner - 2017-07-13
last replythread Test results for AMD Ryzen - Lefty - 2017-07-13
reply Test results for AMD Ryzen - Agner - 2017-07-13
last replythread Test results for AMD Ryzen - Travis - 2017-07-13
last reply Test results for AMD Ryzen - Johannes - 2017-07-25
 
Test results for AMD Ryzen
Author: Agner Date: 2017-05-02 04:22
The new Ryzen processor from AMD represents a complete redesign of the CPU microarchitecture. This is the first of a series of "Zen" architecture processors. I must say that this redesign is a quite successful one which puts AMD back in the game after several years of lagging behind Intel in performance.

The Ryzen has a micro-operation cache which can hold 2048 micro-operations or instructions. This is sufficient to hold the critical innermost loop in most programs. There has been discussions of whether the Ryzen would be able to run four instructions per clock cycle or six, because the documents published by AMD were unclear at this point. Well, my testing shows that it was not four, and not six, but five. As long as the code is running from the micro-operations cache, it can execute five instructions per clock, where Intel has only four. Code that doesn't fit into the micro-operations cache run from the traditional code cache at a maximum rate of four instructions per clock. However, the rate of fetching code from the code cache is not 32 bytes per clock, as some documents seem to indicate, but mostly around 16 bytes per clock. The maximum I have seen is 17.3 bytes per clock. This is a likely bottleneck since most instructions in vector code are more than four bytes long.

The combination of a compare instruction and a conditional jump can be fused together into a single micro-op. This makes it possible to execute a tiny loop with up to six instructions in one clock cycle per iteration. Except for tiny loops, the throughput for jumps is one jump per two clock cycles if the jump is taken, or two not-taken jumps per clock cycle.

256-bit vector instructions (AVX instructions) are split into two micro-ops handling 128 bits each. Such instructions take only one entry in the micro-operation cache. A few other instructions also generate two micro-ops. The maximum throughput of the micro-op queue after the decoders is six micro-ops per clock. The stream of micro-ops from this queue are distributed between ten pipelines: four pipes for integer operations on general purpose registers, four pipes for floating point and vector operations, and two for address calculation. This means that a throughput of six micro-ops per clock cycle can be obtained if there is a mixture of integer and vector instructions.

Let us compare the execution units of AMD's Ryzen with current Intel processors. AMD has four 128-bit units for floating point and vector operations. Two of these can do addition and two can do multiplication. Intel has two 256-bit units, both of which can do addition as well as multiplication. This means that floating point code with scalars or vectors of up to 128 bits will execute on the AMD processor at a maximum rate of four instructions per clock (two additions and two multiplications), while the Intel processor can do only two. With 256-bit vectors, AMD and Intel can both do two instructions per clock. Intel beats AMD on 256-bit fused multiply-and-add instructions, where AMD can do one while Intel can do two per clock. Intel is also better than AMD on 256-bit memory writes, where Intel has one 256-bit write port while the AMD processor has one 128-bit write port. We will soon see Intel processors with 512-bit vector support, while it might take a few more years before AMD supports 512-bit vectors. However, most of the software on the market lags several years behind the hardware. As long as the software uses only 128-bit vectors, we will see the performance of the Ryzen processor as quite competitive. The AMD can execute six micro-ops per clock while Intel can do only four. But there is a problem with doing so many operations per clock cycle. It is not possible to do two instructions simultaneously if the second instruction depends on the result of the first instruction, of course. The high throughput of the processor puts an increased burden on the programmer and the compiler to avoid long dependency chains. The maximum throughput can only be obtained if there are many independent instructions that can be executed simultaneously.

This is where simultaneous multithreading comes in. You can run two threads in the same CPU core (this is what Intel calls hyperthreading). Each thread will then get half of the resources. If the CPU core has a higher capacity than a single thread can utilize then it makes sense to run two threads in the same core. The gain in total performance that you get from running two threads per core is much higher in the Ryzen than in Intel processors because of the higher throughput of the AMD core (except for 256-bit vector code).

The Ryzen is saving power quite aggressively. Unused units are clock gated, and the clock frequency is varying quite dramatically with the workload and the temperature. In my tests, I often saw a clock frequency as low as 8% of the nominal frequency in cases where disk access was the limiting factor, while the clock frequency could be as high as 114% of the nominal frequency after a very long sequence of CPU-intensive code. Such a high frequency cannot be obtained if all eight cores are active because of the increase in temperature.

The varying clock frequency was a big problem for my performance tests because it was impossible to get precise and reproducible measurements of computation times. It helps to warm up the processor with a long sequence of dummy calculations, but the clock counts were still somewhat inaccurate. The Time Stamp Counter (TSC), which is used for measuring the execution time of small pieces of code, is counting at the nominal frequency. The Ryzen processor has another counter called Actual Performance Frequency Clock Counter (APERF) which is similar to the Core Clock Counter in Intel processors. Unfortunately, the APERF counter can only be read in kernel mode, unlike the TSC which is accessible to the test program running in user mode. I had to calculate the actual clock counts in the following way: The TSC and APERF counters are both read in a device driver immediately before and after a run of the test sequence. The ratio between the TSC count and the APERF count obtained in this way is then used as a correction factor which is applied to all TSC counts obtained during the running of the test sequence. This method is awkward, but the results appear to be quite precise, except in the cases where the frequency is varying considerably during the test sequence. My test program is available at www.agner.org/optimize/#testp

AMD has a different way of dealing with instruction set extensions than Intel. AMD keeps adding new instructions and remove them again if they fail to gain popularity, while Intel keeps supporting even the most obscure and useless undocumented instructions dating back to the first 8086. AMD introduced the FMA4 and XOP instruction set extensions with Bulldozer, and some not very useful extensions called TBM with Piledriver. Now they are dropping all these again. XOP and TBM are no longer supported in Ryzen. FMA4 is not officially supported on Ryzen, but I found that the FMA4 instructions actually work correctly on Ryzen, even though the CPUID instruction says that FMA4 is not supported.

Detailed results and list of instruction timings are in my manuals: www.agner.org/optimize/#manuals.

   
Ryzen analyze
Author:  Date: 2017-05-02 04:23
Hello Agner, I have recently read your anylyze and test of the new Ryzen microprocessor. Good work ! I would like to ask you, if you will be so kind and do your testing with an Excavator CPU (Bristol Ridge on some AM4 motherboard with DDR4) so the analyse of BD family will be complete.

regards Daniel

   
Ryzen analyze
Author: Agner Date: 2017-05-02 07:46
Daniel wrote:
I would like to ask you, if you will be so kind and do your testing with an Excavator CPU.
I have not been able to get my hands on an Excavator and find a motherboard that fits it. If you have one, and you give me remote access, I can test it.
   
Test results for AMD Ryzen
Author:  Date: 2017-05-02 14:16
On Ryzen, Bulldozer, and/or Jaguar, does vxorps-zeroing of a ymm register still only take 1 micro-op? Unlike the non-special case, where vxorps ymm1,ymm2,ymm3 which is split into two?

I'm worried that the special-case of xor-zeroing might not be identified until after the decoder has already split it in two. Or that if it still needs an execution port, ymm zeroing might still use two instead of taking advantage of AVX implicit zero-extension to high lanes. (Previously posted at stackoverflow.com/questions/43713273/is-vxorps-zeroing-on-amd-jaguar-bulldozer-zen-faster-with-xmm-registers-than-ymm, but this is probably a better place to ask.)

If ymm-zeroing is slower on any CPUs, then compilers should use vxorps xmm0,xmm0,xmm0 even for _mm256_setzero_ps.

---

For _mm512_setzero_ps, using a VEX-encoded instruction saves a byte vs. EVEX. (reported to clang as bug 32862).
No existing AVX512 hardware has a problem with mixing VEX and EVEX vector instructions, or vector widths, AFAIK. And there's no reason to expect problems on future CPUs because of AVX's zero-extending to VLMAX.

----

On Intel CPUs, the choice affects whether it warms up the 256b execution units (and throttles the max-turbo on Xeon CPUs). So calling a noinline function that returns _mm256_setzero_ps wouldn't be a reliable way to warm up the execution units. But it already wasn't portably reliable anyway, because MSVC already always uses 128b vxorps for zeroing ymm/zmm regs. Returning 256b all-ones would work, but only clang and icc avoid loading a constant when AVX2 isn't available. See all 4 compilers on godbolt.

   
Test results for AMD Ryzen
Author: Agner Date: 2017-05-03 00:12
Peter Cordes wrote:
does vxorps-zeroing of a ymm register still only take 1 micro-op? Unlike the non-special case, where vxorps ymm1,ymm2,ymm3 which is split into two?

I'm worried that the special-case of xor-zeroing might not be identified until after the decoder has already split it in two. Or that if it still needs an execution port, ymm zeroing might still use two instead of taking advantage of AVX implicit zero-extension to high lanes.

xor'ing a ymm register with itself generates two micro-ops, while xor'ing an xmm register with itself generates only one micro-op. So you have a point. Zeroing a register as 128 bit and relying on implicit zero extension to 256 bits is faster. The throughput of these is four micro-ops per clock, i.e. four 128 bit xor or two 256 bit xor per clock. These micro-ops go to any of the four floating point units in the Ryzen. There is no dependence on the previous value of the register.
   
Test results for AMD Ryzen
Author: Phenominal Date: 2017-05-06 00:29
Agner wrote:
The maximum I have seen is 17.3 bytes per clock. This is a likely bottleneck since most instructions in vector code are more than four bytes long.
Does SMT have any effect on this? If SMT is disabled in bios it may fetch 32 bytes. Also in general how IPC is effected for single threaded tests based on SMT bios setting?

Thanks for the great documents.

   
Test results for AMD Ryzen
Author: Agner Date: 2017-05-06 03:08
Phenominal wrote:
Does SMT have any effect on this?
These results are for a single thread. The maximum throughput is only slightly more than half of this if two threads are running in the same core.
   
Test results for AMD Ryzen
Author: Phenominal Date: 2017-05-06 04:33
Thanks for the reply. What I was wondering is since some parts of the core is statically partitioned, whether enable/disable SMT in BIOS makes a difference to single threaded test running on single core. Some reviewers has mentioned that disabling SMT in BIOS has improved some benchmarks. This could also impact instruction fetch. Any truth to these statements?
   
Test results for AMD Ryzen
Author: Agner Date: 2017-05-06 09:41
Phenominal wrote:
Some reviewers has mentioned that disabling SMT in BIOS has improved some benchmarks. This could also impact instruction fetch. Any truth to these statements?
The benchmarks are probably using as many threads as possible. The cost of SMT is that all the threads run at half speed, and you get additional cache evictions and branch mispredictions. The processor core has no problem allocating all resources to a single thread when it is only running one thread.

I don't have physical access to the machine I have tested so I haven't been able to experiment with the BIOS setting, but I haven't had any reason to do so either.

   
Test results for AMD Ryzen
Author: Tacit Murky Date: 2017-05-05 12:24
Hello, Agner.
Here are latest results from AIDA64 HW-bench for Zen: users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt . We can see it takes 0.23 cl to execute «2299 LNOP :LNOP8» (8-byte long NOP), which makes it ~35 B/cl. However, it's not clear whether it's from L1I or L0m (mop-cache). Also it's not clear about generating 2 mops by 1 decoder lane: is it possible for all 2-mop instructions or just AVX-256? So, can 2 mops fit in 1 mop-entry of L0m cache, even if it's not AVX-256 instruction? We only know that microcoded instructions are not cacheble, reading directly from mROM. Where „fused“ 2-mop instructions dissolve to 2 distinct mops?

It is important to know topology and restrictions of L0m. Code portion from L1I (32 B probably) generate certain amount of mops to be cached. Intel CPU requires cached 32 B portion to have 1-18 mops to fit into 1-3 L0m lines (6-mop each), all located in a common set. And it's not possible to break multi-mop instruction between lines. A cached portion must have only 1 entry (at its start); jumping in the middle will cause a miss and refill, so there can be copies of same portion with different entry points. And there is a maximum of 2 jumps per line. Zen must have similar rules, that must be tested. Remember: some 4 years ago you did a test of Sandy Bridge's L0m and send me xls-file with some interesting results. I hope you still got that code.

Eviction policy is also important. Is L0m inclusive with L1I? Will L0m flush on context switch? But we do know that L0m will statically divide for 2 threads. Also, L0m decreases branch misspredict penalty (if target address is cached) — by yet unknown value. Is it possible to read a line for one thread and write for another in the same cycle?

It's good to know the details about 6 OoO-queues (14 mops each) for 6 GPR execution ports. How mops are distributed among them on allocation? Then, knowing that FMAs use 3 inputs, borrowing 1 read port „aside“, how many vector reads can be made per clock with both FMAs loaded?

   
Test results for AMD Ryzen
Author: Tacit Murky Date: 2017-07-08 13:23
Finally, opt. guide has arrived: support.amd.com/TechDocs/55723_SOG_Fam_17h_Processors_3.00.pdf . There are many curious details; here are some notes:

1. Many statements for 32 B I-fetch (p.19, 31). But page 29 says: «Processor can read an aligned 64-byte fetch block every cycle, [so] aligning the end of the loop to the last byte of a 64-byte cache line is the best thing to do.» Perhaps, they mean „L1I can read“? Later on p.31: «[There is] 20-entry Instruction Byte Queue (IBQ); each entry holding 16 instruction bytes… Decode unit scans 2 of these windows in a given cycle… The pick window is 32 byte aligned on a 16-byte boundary. Having 16 byte aligned branch targets gets maximum picker throughput… Only the first pick slot (of 4) can pick instructions greater than 8 bytes in length. Avoid having more than 1 instruction in a sequence of 4 that is greater than 8 bytes in length.» So, 32 B/cl. is possible if all instructions are 8 B long and aligned. This restriction should not be required for op-cache fetches.

2. According to p.19 and 31, 8 macro-ops/cl. are fetched from op-cache; but 6 are allocated in scheduler(s), so there is no way to check former number.

3. Some details for way prediction for L1D (p.24).

4. Nothing is said about famed «neuro-predictor» (perceptron, actually). However, this is strange (p.28): «The conditional branch predictor uses a global history scheme that keeps track of the previously executed branches. Global history is not updated for not-taken branches. For this reason, dynamic branches which are biased towards not-taken are preferred.» So, how does this history register works, if no zeroes for not-taken jumps are written in? Clearly, they don't mean never-taken branches.

5. More (p.29): «Fetch windows are tracked in a 64-entry (32 entries in SMT mode) FIFO [queue] from fetch until retirement. Each entry holds branch and cacheline information for up to a full 64-byte cacheline. If a single BTB entry is not sufficient to allow prediction to the end of the cache line, additional entries are used. If no branches are identified in a cacheline, the fetch window tracking structure will use a single entry to track the entire cacheline.» So, are these „additional entries are used“ in Fetch window tracking queue (not in BTB)? Thin this is equivalent of branch buffer in Intel CPUs. Only this one limits not only number of in-flight jumps in the core (per thread), but also number of cache-lines of code (to 64).

6. P.32 gives few details about op-cache. Nothing is said about how many op cache „lines“ (8 Mops each) can hold a cached 64 B code portion; however, «OC entry terminates at the end of a 64-byte aligned memory region». If that means it's not possible to hold more than 8 decoded instructions in a 64 B portion — that's too stupid to be true. Intel's mop-cache can hold 18 mops for a 32 B portion.

7. P.35 says about FPU port reuse: «If data for Pipe3 or the 3rd operand can be bypassed from a result generated that same cycle, then Pipe3 can execute an operation even when either pipe0 or pipe1 require a 3rd source.» This means it's possible to execute 2x(FMA+FADD) with 6 operations per clock, if no more than 8 new source registers are read and 2 more are reused.

8. P.38 wrongly says there is a 44-entry load buffer in the LSU. It's 72 reads.

9. A pity that referenced «Family 17h Instruction Latencies version_1-00.xlsx» file can not be found anywhere (yet).

   
Test results for AMD Ryzen--POPCNT
Author:  Date: 2017-05-08 11:06
Hi Agner,

Nice work! I see you mentioned about the throughput of POPCNT instruction in Ryzen is 0.25 cycle. Does that mean a single Ryzen core is able to execute 4 POPCNT (64bit register operands) at the same time? Thank you.

   
Test results for AMD Ryzen--POPCNT
Author: Agner Date: 2017-05-11 10:15
Xing Liu wrote:
Does that mean a single Ryzen core is able to execute 4 POPCNT (64bit register operands) at the same time?
Yes
   
Test results for AMD Ryzen
Author:  Date: 2017-07-11 17:03
Thank you for your detailed analysis on Ryzen.

Do you have any thoughts on AMD's new EPYC server platform and how it compares to its competitor, Intel's Skylake-SP?

   
EPYC
Author: Agner Date: 2017-07-11 22:57
I have not tested EPYC, but it uses the Zen microarchitecture so the core performance would be the same as for Ryzen.
   
Test results for AMD Ryzen
Author: Lefty Date: 2017-07-12 09:53
Hi Agner,
Thanks for very interesting post.
I have a quick question: Would executing a 128-bit vector instruction on a Skylake CPU use up twice the power as Zen, because it has to powering up all 256-bits of its AVX unit, while Zen only powers up a single 128-bit unit?
   
Test results for AMD Ryzen
Author: Agner Date: 2017-07-12 11:55
No, it will probably power down the upper part when it has not been used for some time.
   
Test results for AMD Ryzen
Author:  Date: 2017-07-13 02:38
Does Ryzen still perform poorly on MKL due to the "cripple AMD" function? Would there be any workaround to this?
   
Test results for AMD Ryzen
Author: Agner Date: 2017-07-13 09:40
Yes. The workaround is the same as I have described earlier.
   
Test results for AMD Ryzen
Author: Lefty Date: 2017-07-13 02:41
Thanks for the answer.
I have another question. I am wondering why AVX-256 / AVX -512 is considered superior to AVX-128.
You can pack 2 AVX-128 instructions into one AVX-256 instruction (provided that the instructions are independent), but it will not necessarily execute faster. A CPU with one 256 bit SIMD unit can execute the AVX-256 instruction in one cycle, however a CPU with 2 128-bit SIMD units would just schedule the 2 AVX-128 instructions to execute simultaneously - also in one cycle. I don't see where the advantage is.
   
Test results for AMD Ryzen
Author: Agner Date: 2017-07-13 09:45
There is an advantage only if instruction fetch and decoding is the bottleneck, which is often the case. Future processors will probably have higher throughput for 512 bit instructions. This happens every time they increase the vector size. The first processor to support a new vector size always has inferior performance. This makes sense because there is very little software on the market to support the new vector size.
   
Test results for AMD Ryzen
Author:  Date: 2017-07-13 13:48
Lefty wrote:
Thanks for the answer.
I have another question. I am wondering why AVX-256 / AVX -512 is considered superior to AVX-128.
You can pack 2 AVX-128 instructions into one AVX-256 instruction (provided that the instructions are independent), but it will not necessarily execute faster. A CPU with one 256 bit SIMD unit can execute the AVX-256 instruction in one cycle, however a CPU with 2 128-bit SIMD units would just schedule the 2 AVX-128 instructions to execute simultaneously - also in one cycle. I don't see where the advantage is.
In general there might not be an advantage if you are comparing a CPU that offers 2N execution units of width W versus one which offers N execution units of width 2W (e.g., 4 x 128-bit units versus 2 x 256-bit units) - but that's not usually the comparison you would see in actual hardware. In general it is much easier to extend the length of the vector units by 2x than it is to sustainably execute at double the IPC. Indeed, Intel chips have been "stuck" at 4-wide for nearly a decade despite increasing from 128-bits to 512-bits on the vector size.

To double sustained IPC (in code that can provide the necessarily ILP in the first place) you'd have to have to approximately double fetch, decode, rename and retire throughput, and increase the size of many structures such as the ROB and PRF. Even then you might run out of registers in the ISA since you effectively need twice as many registers to keep the same amount of data "in flight". Many of these changes aren't just linear increases in hardware complexity, but quadratic or worse - and at some point they aren't even possible without reducing the clock frequency.

Increasing the width of the SIMD units, on the other hand, is generally a straightforward linear increase in complexity (with the exception of some lane-crossing operations, which is why those often have a longer latency and are generally discouraged).

   
Test results for AMD Ryzen
Author:  Date: 2017-07-25 08:54
I have a question about attainable L1 cache bandwidth.

The Software Optimization Guide for AMD Family 17h Processors has been published recently (http://support.amd.com/TechDocs/55723_SOG_Fam_17h_Processors_3.00.pdf). It states that the L1 cache can handle two 16B loads and one 16B store per cycle. However www.agner.org/optimize/microarchitecture.pdf under 19.17 states "The data cache has two 128-bit ports which can be used for either read or write. It can do two reads or one read and one write in the same clock cycle." On what data is this latter statement based?

In contrast to AMD's statements, I haven't been able to get more than 32B/c from the L1 cache, giving credibility to the statement found in www.agner.org/optimize/microarchitecture.pdf. But I'm not sure what exactly the problem is. Either each AVX load or store is split into two separate SSE uops, each of which requires a dedicated AGU access. In that case the number of AGUs (two) limit the achievable L1 bandwidth. Or the L1 cache has in fact only two 128-bit ports as stated in www.agner.org/optimize/microarchitecture.pdf