AMD 'heavy equipment' CPUs

News and research about CPU microarchitecture and software optimization
Post Reply
DamageX
Posts: 2
Joined: 2022-12-05, 19:14:41

AMD 'heavy equipment' CPUs

Post by DamageX » 2022-12-05, 20:01:05

Recently I've been reading your 'microarchitecture' guide with great interest. Thanks for making this available!

Regarding the AMD family 15h CPUs, on pg. 220 you write:

"It saves power quite aggressively by slowing down the clock speed most of the time. Some versions also lower the voltage to the CPU when the clock speed is reduced. The maximum clock speed is only obtained after a long sequence of CPU-intensive code."

I believe that the power-saving states (ones with lower frequency/voltage) are entered into in response to a software command (specifically, the P-state command register MSRC001_0062). Generally this would be controlled by ACPI and there should be BIOS settings to alter this behavior. On the other hand, I believe that Core Performance Boost (a.k.a. 'turbo') does result in frequency/voltage changes without software intervention, although there should be a BIOS setting to enable/disable this feature also. I wonder if Core Performance Boost could have affected your experimentation on this CPU. In particular, on page 228 you write:

"The measured throughput is two reads or one read and one write per clock cycle when only one thread is active. We would not expect the throughput to be less when multiple threads are active because each core has separate load/store units and level-1 data cache. But my measurements indicate that level-1 cache throughput is several times lower when multiple threads are running, even if the threads are running in different units that do not share any level-1 or level-2 cache."

I wonder if, while having CPB enabled, running a test with multiple threads caused the CPU to intermittently change between the various boosted P-states and/or the base frequency, leading to strange results.

According to AMD's manual, the L1 data cache is write-through and there is a queue which holds the data until it can be written to L2. So there is an explanation for reduced write performance when two threads are running on the same module at least. Threads on different modules are more difficult to explain...

DamageX
Posts: 2
Joined: 2022-12-05, 19:14:41

Re: AMD 'heavy equipment' CPUs

Post by DamageX » 2022-12-27, 20:41:33

FWIW, I ran a test of my own on a Steamroller CPU, with clock speed locked at 3.9GHz. My test routines read or wrote 2048 32-bit words (8KByte) using a small loop, and this was done 1,000,000 times. I used one thread, two threads on the same module, or two threads on separate modules. In all cases, the time to complete the test was the same running one thread as it was running two threads on separate modules (within the margin of error).

using MOV EAX,[ESI]
one thread = 0.2728 seconds, two threads separately = 0.2720 seconds, two threads together = 0.2724 seconds
about 2 instructions per clock

using MOV [EDI],EAX
one thread = 0.5335 seconds, two threads separately = 0.5320 seconds, two threads together = 0.5324 seconds
about 1 instruction per clock

using REP LODSD
one thread = 1.3604 seconds, two threads separately = 1.3571 seconds, two threads together = 2.1478 seconds
about 0.4 instructions per clock, dropping to 0.25 instructions per clock with two threads on the same module

using REP STOSD
one thread = 0.2703 seconds, two threads separately = 0.2696 seconds, two threads together = 0.3387 seconds
about 2 instructions per clock, except when running two threads on the same module

I believe this confirms your stated L2 write throughput limit of 64 bytes per 6 cycles, as well as your instruction cycle counts listed in the tables. It is interesting that although the Steamroller is supposed to have separate instruction decoding for each thread, the REP LODSD loops still appear to suffer some contention.

Post Reply