Agner`s CPU blog

Test results for AMD Ryzen

Author: Tacit Murky

Date: 2017-07-08 13:23

Finally, opt. guide has arrived: support.amd.com/TechDocs/55723_SOG_Fam_17h_Processors_3.00.pdf . There are many curious details; here are some notes:

1. Many statements for 32 B I-fetch (p.19, 31). But page 29 says: Â«Processor can read an aligned 64-byte fetch block every cycle, [so] aligning the end of the loop to the last byte of a 64-byte cache line is the best thing to do.Â» Perhaps, they mean â€žL1I can readâ€œ? Later on p.31: Â«[There is] 20-entry Instruction Byte Queue (IBQ); each entry holding 16 instruction bytesâ€¦ Decode unit scans 2 of these windows in a given cycleâ€¦ The pick window is 32 byte aligned on a 16-byte boundary. Having 16 byte aligned branch targets gets maximum picker throughputâ€¦ Only the first pick slot (of 4) can pick instructions greater than 8 bytes in length. Avoid having more than 1 instruction in a sequence of 4 that is greater than 8 bytes in length.Â» So, 32 B/cl. is possible if all instructions are 8 B long and aligned. This restriction should not be required for op-cache fetches.

2. According to p.19 and 31, 8 macro-ops/cl. are fetched from op-cache; but 6 are allocated in scheduler(s), so there is no way to check former number.

3. Some details for way prediction for L1D (p.24).

4. Nothing is said about famed Â«neuro-predictorÂ» (perceptron, actually). However, this is strange (p.28): Â«The conditional branch predictor uses a global history scheme that keeps track of the previously executed branches. Global history is not updated for not-taken branches. For this reason, dynamic branches which are biased towards not-taken are preferred.Â» So, how does this history register works, if no zeroes for not-taken jumps are written in? Clearly, they don't mean never-taken branches.

5. More (p.29): Â«Fetch windows are tracked in a 64-entry (32 entries in SMT mode) FIFO [queue] from fetch until retirement. Each entry holds branch and cacheline information for up to a full 64-byte cacheline. If a single BTB entry is not sufficient to allow prediction to the end of the cache line, additional entries are used. If no branches are identified in a cacheline, the fetch window tracking structure will use a single entry to track the entire cacheline.Â» So, are these â€žadditional entries are usedâ€œ in Fetch window tracking queue (not in BTB)? Thin this is equivalent of branch buffer in Intel CPUs. Only this one limits not only number of in-flight jumps in the core (per thread), but also number of cache-lines of code (to 64).

6. P.32 gives few details about op-cache. Nothing is said about how many op cache â€žlinesâ€œ (8 Mops each) can hold a cached 64 B code portion; however, Â«OC entry terminates at the end of a 64-byte aligned memory regionÂ». If that means it's not possible to hold more than 8 decoded instructions in a 64 B portion â€” that's too stupid to be true. Intel's mop-cache can hold 18 mops for a 32 B portion.

7. P.35 says about FPU port reuse: Â«If data for Pipe3 or the 3rd operand can be bypassed from a result generated that same cycle, then Pipe3 can execute an operation even when either pipe0 or pipe1 require a 3rd source.Â» This means it's possible to execute 2x(FMA+FADD) with 6 operations per clock, if no more than 8 new source registers are read and 2 more are reused.

8. P.38 wrongly says there is a 44-entry load buffer in the LSU. It's 72 reads.

9. A pity that referenced Â«Family 17h Instruction Latencies version_1-00.xlsxÂ» file can not be found anywhere (yet).

Reply To This Message

Previous Message

Next Message