Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

Optimization manuals updated - Silvermont test
Author: Agner Date: 2014-08-08 06:02
My manuals have now finally been updated with a test of Intel's Silvermont (Bay Trail) processor.

Intel's old low-power processor named Atom has now finally got a major update after several years in service. The new design called Silvermont is a small low-power processor intended mainly for mobile devices and as a competitor for ARM machines.

The Silvermont still contains traces of the old Atom design, but almost everything has been improved or redesigned. The chip has one or more units with two cores each. The two cores in a unit share the same level-2 cache but they have separate execution resources. Thus, there is no competition for execution resources between threads.

The Silvermont supports the SSE4.2 instruction set, but not AVX and AVX2. It has a throughput of two instructions per clock cycle. There are two execution pipes for integer instructions, two for floating point and vector instructions, and one for memory read and write. Internal buses and execution units are 128 bits wide. Most execution units are pipelined, but some operations are staying in the same pipeline stage for two (rarely four) clock cycles in the cases of large data sizes or high precision.

The high end processors from Intel and AMD have powerful capabilities for out-of-order execution, while the old Atom executes all instructions in program order. The Silvermont is a compromise between these two. It has some out-of-order execution, but not much. Integer instructions in general purpose registers can execute out of order with a depth of at most eight instructions. Floating point and vector instructions cannot execute out of order with other instructions in the same of the two floating point pipelines. There is full register renaming.

The cache size is reasonable: 32kB level-1 code, 24kB level-1 data, 1MB level-2. Cache latencies were 3 and 19 clock cycles in my measurements, and the cache performance is generally good.

The whole design seems well proportioned with reasonable capacities for a low-power chip in all stages of the pipeline - except for one very big bottleneck: the decoders. Simple instructions can decode at a rate of two instructions per clock cycle, but there are quite a lot of instructions that the decoders cannot handle so smoothly. Instructions that generate more than one micro-operation, as well as instructions with certain combinations of prefixes and escape codes, take four, six or even more clock cycles to decode. In many of my test cases I was unable to determine the latency and throughput of the execution units for certain instructions because the decoders were far behind the execution units. The designers have already removed the common bottleneck of instruction-length decoding by marking instruction boundaries in the code cache (a technique that Intel haven't used since the Pentium MMX seventeen years earlier). It should be possible to remove the unfortunate bottleneck in the decoders without sacrificing too much power consumption. Let's hope that Intel will have solved this problem in the next version of the Silvermont, as well as in the forthcoming Knights Landing coprocessor, which is rumored to be based on the Silvermont architecture.

Other news in my manuals include calling conventions for the forthcoming AVX512 instruction set, and an update on how to circumvent Intel's CPU dispatcher for Intel compiler version 14.

 
thread Optimization manuals updated new - Agner - 2013-09-04
reply Optimization manuals updated new - Agner - 2014-02-19
replythread Latency of PTEST/VPTEST new - Nathan Kurz - 2014-05-20
last reply Latency of PTEST/VPTEST new - Agner - 2014-05-20
replythread Optimization manuals updated - Silvermont test - Agner - 2014-08-08
last replythread Optimization manuals updated - Silvermont test new - Tacit Murky - 2014-08-11
last reply Optimization manuals updated - Silvermont test new - Agner - 2014-08-13
replythread Conditional operation new - Just_Coder - 2014-09-20
last replythread Conditional operation new - Agner - 2014-09-21
last reply Conditional operation new - Slacker - 2014-10-06
replythread Optimization manuals updated new - Slacker - 2014-10-06
last reply Optimization manuals updated new - jenya - 2014-10-10
replythread FP pipelines on Intel's Haswell core new - John D. McCalpin - 2014-10-17
reply FP pipelines on Intel's Haswell core new - Agner - 2014-10-18
last replythread FP pipelines on Intel's Haswell core new - Jorcy de Oliveira Neto - 2015-09-24
last reply FP pipelines on Intel's Haswell core new - Agner - 2015-09-25
replythread Micro-fusion limited to 1-reg addressing modes new - Peter Cordes - 2015-07-11
replythread Micro-fusion limited to 1-reg addressing modes new - Agner - 2015-07-12
last reply Micro-fusion limited to 1-reg addressing modes new - Tacit Murky - 2015-11-15
last replythread Micro-fusion limited to 1-reg addressing modes new - Agner - 2015-12-01
reply Micro-fusion limited to 1-reg addressing modes new - Peter Cordes - 2015-12-15
last reply Micro-fusion limited to 1-reg addressing modes new - Peter Cordes - 2016-05-24
last replythread Skylake? new - Travis - 2015-10-21
last replythread Skylake? new - Agner - 2015-10-22
replythread Skylake? new - John D. McCalpin - 2015-10-22
reply Skylake? new - Adrian Bocaniciu - 2015-10-23
last reply Skylake? new - Bigos - 2015-10-23
last replythread Skylake? new - Slacker - 2015-10-24
last replythread Excavator and Puma new - Agner - 2015-12-16
reply Excavator and Puma new - Slacker - 2016-01-03
reply Excavator and Puma new - Daniel - 2016-01-16
last reply Excavator and Puma new - Jonathan Morton - 2016-02-02