Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Haswell upper128 power gating
Author:  Date: 2015-08-28 22:58
John D. McCalpin wrote:
What I observed was
(1) an initial "emulation" period of 4-7 iterations that took ~2200 cycles each,
(2) a "transition" iteration that took over 31,000 cycles -- about 25,500 halted, and about 5500 active,
(3) "normal" behavior of 512 or 516 cycles for the rest of the iterations (after subtracting the approximate overhead).

The huge number of halted cycles in (2) makes it likely that this isn't just a timer interrupt hitting one of your iterations or something. Probably not migrating from one core to another, either.

I agree with your speculation that this is probably the core halting for voltages to settle after powering up the high half of the execution units. I would have thought it might be possible to keep doing (1) "emulation" while the upper half settled, but this shows that Haswell doesn't work that way.

I wouldn't be so quick to assume that powering down the upper 128 doesn't halt for a lot of cycles, too. There will be some capacitance, so the supply voltage won't go to zero instantly. Garbage signals coming out of the upper128 vector units as the charge dissipates could well be a problem. Clearly there isn't gating to protect the rest of the execution unit from this, or emulation mode could continue while the upper128 powered up, and you'd go from (1) to (3) without a slow transition iteration. (not *that* slow, anyway.)

If we're lucky, powering down the upper128 of the vector units won't slow down integer code that uses different execution units, even though the integer execution units are on the same ports as the vector execution units. So it would be useful to alternate xmm and ymm vector loops, and ymm with non-vector loops, to look for a difference in the number of halted cycles when the CPU decides to power down the upper128.

Maybe the CPU's internal power management won't power down the upper128 unless the core is halted for another reason? Your 1ms of spin-loop threshold seems to rule that out, though.

I assume the whole core halts, affecting both hyperthreads, because it's due to a physical process. I guess you could look for this effect by timing a loop repeatedly on the other hardware thread, and recording timestamps for anomalies. If the timestamp for an extra-slow iteration in one thread was close to the timestamp for the transition iteration in the 256b-vector loop, then you could conclude that the whole core halted.

I added an outer loop with a (non-256-bit) "spinner" to see how long it takes for the processor to revert to initial behavior. If the spinner between outer loop iterations was less than 1 millisecond, the subsequent inner iterations ran at full speed. If the spinner between outer loop iterations was more than 1 millisecond, the subsequent inner iterations showed the behavior above.

This behavior occurs even if the core frequency is bound any of the available frequencies (except perhaps the lowest frequency -- I need to go back and double-check those results). Performance counters showed that the core was running at the requested frequency in each case (comparing actual and reference cycles gave the expected ratio).
There are no kernel cycles, even during the transition.
The performance counters for micro-ops dispatched to the various ports show only very minor differences between the "warm-up" and "normal" cycles.
I could not find *any* performance counters (other than cycles) that could distinguish between the 1/4-speed "warm-up" and "normal" operations (but I have not tried all of them).

I'm not surprised that this is unrelated to frequency. Power-gating the upper128 of the vector units is a win at any frequency. Saving power at max frequency allows you to stay at max turbo longer. (Not to mention battery life.)

I think one compelling reason for doing it at a low level inside the execution units, rather than with special uops, is that Intel CPUs that support AVX also have a uop cache. You don't want to have to mark lines in the uop cache as "decoded for 128b-emulation" vs. "decoded for 256b vector units", and then potentially re-decode after powering up / down the upper128.

OTOH, extra uops could be generated on the fly in the scheduler that follows the ROB (re-order buffer), when uops are converted from fused-domain to unfused domain. If that's how it works, these uops could be flagged as "internally generated" so the perf counters don't count them. They may need to be flagged this way anyway, for things to work correctly. I doubt Intel would add extra complexity just for perf-counter bookkeeping to hide the internals. You did look at all the different uop issue / execute / retire counters, some of which count in the fused domain, and some of which count unfused uops, right? You said you looked at uops dispatched to ports, so I guess that should cover the unfused domain.

As you point out, it's a bit surprising that perf is worse than half. Pentium M had 64b execution units, and took longer for 128b vector ops, but only about twice as long. In that case, though, 128b vector ops decoded to 2 uops, instead of having shuffling within the execution unit. Maybe this emulation mode isn't fully pipelined, or the unusual latency creates write-back conflicts?

If emulation mode was fairly efficient, the upper128 might never need to power on for 256b code that was limited by memory bandwidth, frontend (ROB not filling up), or insn latency rather than throughput. Even 1/4 perf might still be efficient enough for some cases.

Maybe it would have taken more transistors to make emulation mode faster, and they decided it wasn't worth it to speed up the slow mode and be less aggressive in powering up the upper128.

 
thread Test results for Intel's Sandy Bridge processor new - Agner - 2011-01-30
reply Test results for Intel's Sandy Bridge processor new - PaulR - 2011-02-15
replythread AVX2 new - phis - 2011-06-23
last reply AVX2 new - Agner - 2011-06-23
replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-01
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-06
last replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-07
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-08
last replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-08
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-09
last replythread Test results for Intel's Sandy Bridge processor new - anon - 2013-08-09
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-10
last reply Test results for Intel's Sandy Bridge processor new - Agner - 2013-08-10
replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2013-10-09
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2013-10-10
last replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2013-10-11
last replythread SB's L1D banks new - Tacit Murky - 2013-11-03
last reply SB's L1D banks new - John D. McCalpin - 2013-11-07
replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2015-08-18
replythread Test results for Intel's Sandy Bridge processor new - Agner - 2015-08-18
last replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2015-08-24
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2015-08-25
last reply Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2015-08-25
replythread Haswell upper128 power gating - Peter Cordes - 2015-08-28
last replythread Haswell upper128 power gating new - Agner - 2016-01-16
last replythread Haswell upper128 power gating new - John D. McCalpin - 2016-01-29
last reply Haswell upper128 power gating new - Agner - 2016-01-30
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2015-12-20
last replythread Test results for Intel's Sandy Bridge processor new - John D. McCalpin - 2015-12-21
last replythread Test results for Intel's Sandy Bridge processor new - Agner - 2015-12-22
reply Test results for Intel's Sandy Bridge processor new - Robert - 2015-12-24
last replythread Test results for Intel's Sandy Bridge processor new - Just_Coder - 2015-12-25
last reply Test results for Intel's Sandy Bridge processor new - Agner - 2015-12-26
last replythread Test results for Intel's Sandy Bridge processor new - Just_Coder - 2015-08-23
last reply Test results for Intel's Sandy Bridge processor new - Agner - 2015-08-25