Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Test results for AMD Ryzen
Author: Tacit Murky Date: 2017-07-08 13:23
Finally, opt. guide has arrived: support.amd.com/TechDocs/55723_SOG_Fam_17h_Processors_3.00.pdf . There are many curious details; here are some notes:

1. Many statements for 32 B I-fetch (p.19, 31). But page 29 says: «Processor can read an aligned 64-byte fetch block every cycle, [so] aligning the end of the loop to the last byte of a 64-byte cache line is the best thing to do.» Perhaps, they mean „L1I can read“? Later on p.31: «[There is] 20-entry Instruction Byte Queue (IBQ); each entry holding 16 instruction bytes… Decode unit scans 2 of these windows in a given cycle… The pick window is 32 byte aligned on a 16-byte boundary. Having 16 byte aligned branch targets gets maximum picker throughput… Only the first pick slot (of 4) can pick instructions greater than 8 bytes in length. Avoid having more than 1 instruction in a sequence of 4 that is greater than 8 bytes in length.» So, 32 B/cl. is possible if all instructions are 8 B long and aligned. This restriction should not be required for op-cache fetches.

2. According to p.19 and 31, 8 macro-ops/cl. are fetched from op-cache; but 6 are allocated in scheduler(s), so there is no way to check former number.

3. Some details for way prediction for L1D (p.24).

4. Nothing is said about famed «neuro-predictor» (perceptron, actually). However, this is strange (p.28): «The conditional branch predictor uses a global history scheme that keeps track of the previously executed branches. Global history is not updated for not-taken branches. For this reason, dynamic branches which are biased towards not-taken are preferred.» So, how does this history register works, if no zeroes for not-taken jumps are written in? Clearly, they don't mean never-taken branches.

5. More (p.29): «Fetch windows are tracked in a 64-entry (32 entries in SMT mode) FIFO [queue] from fetch until retirement. Each entry holds branch and cacheline information for up to a full 64-byte cacheline. If a single BTB entry is not sufficient to allow prediction to the end of the cache line, additional entries are used. If no branches are identified in a cacheline, the fetch window tracking structure will use a single entry to track the entire cacheline.» So, are these „additional entries are used“ in Fetch window tracking queue (not in BTB)? Thin this is equivalent of branch buffer in Intel CPUs. Only this one limits not only number of in-flight jumps in the core (per thread), but also number of cache-lines of code (to 64).

6. P.32 gives few details about op-cache. Nothing is said about how many op cache „lines“ (8 Mops each) can hold a cached 64 B code portion; however, «OC entry terminates at the end of a 64-byte aligned memory region». If that means it's not possible to hold more than 8 decoded instructions in a 64 B portion — that's too stupid to be true. Intel's mop-cache can hold 18 mops for a 32 B portion.

7. P.35 says about FPU port reuse: «If data for Pipe3 or the 3rd operand can be bypassed from a result generated that same cycle, then Pipe3 can execute an operation even when either pipe0 or pipe1 require a 3rd source.» This means it's possible to execute 2x(FMA+FADD) with 6 operations per clock, if no more than 8 new source registers are read and 2 more are reused.

8. P.38 wrongly says there is a 44-entry load buffer in the LSU. It's 72 reads.

9. A pity that referenced «Family 17h Instruction Latencies version_1-00.xlsx» file can not be found anywhere (yet).

 
thread Test results for AMD Ryzen new - Agner - 2017-05-02
replythread Ryzen analyze new - Daniel - 2017-05-02
last reply Ryzen analyze new - Agner - 2017-05-02
replythread Test results for AMD Ryzen new - Peter Cordes - 2017-05-02
last replythread Test results for AMD Ryzen new - Agner - 2017-05-03
last replythread Test results for AMD Ryzen new - Phenominal - 2017-05-06
last replythread Test results for AMD Ryzen new - Agner - 2017-05-06
last replythread Test results for AMD Ryzen new - Phenominal - 2017-05-06
last reply Test results for AMD Ryzen new - Agner - 2017-05-06
replythread Test results for AMD Ryzen new - Tacit Murky - 2017-05-05
last reply Test results for AMD Ryzen - Tacit Murky - 2017-07-08
replythread Test results for AMD Ryzen--POPCNT new - Xing Liu - 2017-05-08
last reply Test results for AMD Ryzen--POPCNT new - Agner - 2017-05-11
replythread Test results for AMD Ryzen new - Justin - 2017-07-11
last reply EPYC new - Agner - 2017-07-11
replythread Test results for AMD Ryzen new - Lefty - 2017-07-12
last replythread Test results for AMD Ryzen new - Agner - 2017-07-12
replythread Test results for AMD Ryzen new - cvax - 2017-07-13
last reply Test results for AMD Ryzen new - Agner - 2017-07-13
last replythread Test results for AMD Ryzen new - Lefty - 2017-07-13
reply Test results for AMD Ryzen new - Agner - 2017-07-13
last replythread Test results for AMD Ryzen new - Travis - 2017-07-13
last reply Test results for AMD Ryzen new - Johannes - 2017-07-25
last replythread Test results for AMD Ryzen new - Conrad - 2017-09-22
reply Test results for AMD Ryzen new - Agner - 2017-09-22
last reply Test results for AMD Ryzen new - Travis - 2017-09-26