Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Optimization manuals updated
Author: Agner Date: 2013-09-04 11:10

The optimization manuals at www.agner.org/optimize/#manuals have now been updated. The most important additions are:

  • AMD Piledriver and Jaguar processors are now described in the microarchitecture manual and the instruction tables.
  • Intel Ivy Bridge and Haswell processors are now described in the microarchitecture manual and the instruction tables.
  • The micro-op cache of Intel processors is analyzed in more detail
  • The assembly manual has more information on the AVX2 instruction set.
  • The C++ manual describes the use of my vector classes for writing parallel code.

Some interesting test results for the newly tested processors:

AMD Piledriver

  • Similar microarchitecture to Bulldozer
  • Supports fused multiply-and-add instructions in both the FMA3 and FMA4 form. FMA3 is compatible with Intel processors. See Wikipedia for a discussion of the incompatibility between these instruction sets.
  • The throughput of FMA3 instructions is only half as much as the throughput of FMA4 instructions, even though they are doing exactly the same calculations.
  • Memory writes with the 256-bit AVX registers are exceptionally slow. The measured throughput is 5 - 6 times slower than on the previous model (Bulldozer), and 8 - 9 times slower than two 128-bit writes. No explanation for this has been found. This design flaw is likelty to negate any advantage of using the AVX instruction set.
  • The problems with cache performance on the Bulldozer seem to have been fixed in the Piledriver

AMD Jaguar

  • Similar microarchitecture to Bobcat
  • Supports the AVX instruction set
  • Does not support AMD's 3DNow and XOP instruction sets. This is OK with me since few programmers would care to make a special version of their code specifically for AMD processors.
  • The vector execution units are doubled in size from 64 bits in Bobcat to 128 bits in Jaguar. The throughput of vector instructions is doubled. Floating point scalar (non-vector) performance was quite good already on the Bobcat and is unchanged on the Jaguar.
  • Load and store units are also doubled from 64 bits to 128 bits.
  • Store-to-load forwarding is much faster than on Bobcat
  • The prefetch instruction is particularly slow on Jaguar. The throughput is much lower than on other AMD processors.
  • Integer division is improved
  • Register moves with vector registers are eliminated if the register is known by the processor to be zero. Register moves are not eliminated if the value of the register is unknown. This seems to indicate that registers are not allocated if they are known to be zero.
  • The VMASKMOVPS instruction with a memory source operand takes more than 300 clock cycles on the Jaguar when the mask is zero, in which case the instruction should do nothing. This appears to be a design flaw. This instruction is not very common, though.

Intel Ivy Bridge

  • Similar microarchitecture to Sandy Bridge
  • Can eliminate register-to-register moves by renaming the target register
  • Problem with decoding long NOPs in Sandy Bridge has been fixed
  • Some execution units have been moved to a different port
  • Handling of partial registers is improved
  • The prefetch instructions are particularly slow on Ivy Bridge. The throughput is much lower than on other Intel processors.
  • Store-to-load forwarding is generally good, but in some unfortunate cases of an unaligned 256-bit read after a smaller write, there is an unusually large delay of more than 200 clock cycles.

Intel Haswell

  • Supports the new AVX2 instruction set which allows integer vectors of 256 bits and gather instructions
  • Supports fused multiply-and-add instructions of the FMA3 type
  • The cache bandwidth is doubled to 256 bits. It can do two reads and one write per clock cycle.
  • Cache bank conflicts have been removed
  • The number of read and write buffers, register files, reorder buffer and reservation station are all bigger than in previous processors
  • There are more execution units and one more execution port than on previous processors. This makes a throughput of four instructions per clock cycle quite realistic in many cases.
  • The throughput for not-taken branches is doubled to two not-taken branches per clock cycle, including fused branch instructions. The throughput for taken branches is largely unchanged.
  • There are two execution units for floating point multiplication and for fused multiply-and-add, but only one execution unit for floating point addition. This design appears to be suboptimal since floating point code typically contains more additions than multiplications. But at least it enables Intel to boast a floating point performance of 32 FLOPS per clock cycle.
  • The fused multiply-and-add operation is the first case in the history of Intel processors of micro-ops having more than two input dependencies. Other instructions with more than two input dependencies are still split into two micro-ops, though. AMD processors don't have this limitation.
  • The delays for moving data between different execution units is smaller than on previous Intel processors in many cases.
 
thread Optimization manuals updated - Agner - 2013-09-04
reply Optimization manuals updated new - Agner - 2014-02-19
replythread Latency of PTEST/VPTEST new - Nathan Kurz - 2014-05-20
last reply Latency of PTEST/VPTEST new - Agner - 2014-05-20
replythread Optimization manuals updated - Silvermont test new - Agner - 2014-08-08
last replythread Optimization manuals updated - Silvermont test new - Tacit Murky - 2014-08-11
last reply Optimization manuals updated - Silvermont test new - Agner - 2014-08-13
replythread Conditional operation new - Just_Coder - 2014-09-20
last replythread Conditional operation new - Agner - 2014-09-21
last reply Conditional operation new - Slacker - 2014-10-06
replythread Optimization manuals updated new - Slacker - 2014-10-06
last reply Optimization manuals updated new - jenya - 2014-10-10
replythread FP pipelines on Intel's Haswell core new - John D. McCalpin - 2014-10-17
reply FP pipelines on Intel's Haswell core new - Agner - 2014-10-18
last replythread FP pipelines on Intel's Haswell core new - Jorcy de Oliveira Neto - 2015-09-24
last reply FP pipelines on Intel's Haswell core new - Agner - 2015-09-25
replythread Micro-fusion limited to 1-reg addressing modes new - Peter Cordes - 2015-07-11
replythread Micro-fusion limited to 1-reg addressing modes new - Agner - 2015-07-12
last reply Micro-fusion limited to 1-reg addressing modes new - Tacit Murky - 2015-11-15
last replythread Micro-fusion limited to 1-reg addressing modes new - Agner - 2015-12-01
reply Micro-fusion limited to 1-reg addressing modes new - Peter Cordes - 2015-12-15
last reply Micro-fusion limited to 1-reg addressing modes new - Peter Cordes - 2016-05-24
last replythread Skylake? new - Travis - 2015-10-21
last replythread Skylake? new - Agner - 2015-10-22
replythread Skylake? new - John D. McCalpin - 2015-10-22
reply Skylake? new - Adrian Bocaniciu - 2015-10-23
last reply Skylake? new - Bigos - 2015-10-23
last replythread Skylake? new - Slacker - 2015-10-24
last replythread Excavator and Puma new - Agner - 2015-12-16
reply Excavator and Puma new - Slacker - 2016-01-03
reply Excavator and Puma new - Daniel - 2016-01-16
last reply Excavator and Puma new - Jonathan Morton - 2016-02-02