Agner`s CPU blog

Software optimization resources | E-mail subscription to this blog | www.agner.org

 
thread Moores law hits the roof - Agner - 2015-12-26
reply Moores law hits the roof - bvlad - 2015-12-28
last reply Moores law hits the roof - anon2718 - 2016-03-15
 
Moores law hits the roof
Author: Agner Date: 2015-12-26 09:02
Through the last 40 years we have seen the speed of computers growing exponentially. Today's computers have a clock frequency a thousand times higher than the first personal computers in the early 1980's. The amount of RAM memory on a computer has increased by a factor ten thousand, and the hard disk capacity has increased more than a hundred thousand times. We have become so used to this continued growth that we almost consider it a law of nature, which we are calling Moore's law. But there are limits to growth, which Gordon Moore himself also points out [1]. We are now approaching the physical limit where computing speed is limited by the size of an atom and the speed of light.

Intel's iconic Tick-Tock clock has begun to skip a beat now and then. Every Tick is a shrinking of the transistor size, and every Tock is an improvement of the microarchitecture. The current processor generation called Skylake is a Tock with a 14 nanometer process. The next in sequence would logically be a Tick with a 10 nanometer process, but Intel is now putting "refresh cycles" after the tocks. The next processor, announced for 2016, will be a refresh of the Skylake, still with a 14 nanometer process [2]. This slowdown of the Tick-Tock clock is a physical necessity, because we are approching the limit where a transistor is only a few atoms wide (a silicon atom is 0.2 nanometers).

Another physical limit is the speed of data transmission which cannot exceed the speed of light. It takes several clock cycles for data to flow from one end of a CPU chip to the other. As the chips get bigger with more and more transistors, we are seeing the speed being limited by the data transmission across the chip.

Technological limitations is not the only thing that slows down the evolution of CPU chips. Another factor is reduced market competition. Intel's major competitor, AMD, is now focusing more on what they call APUs (Accelerated Processing Units), i.e. smaller CPU chips with integrated graphics processors for mini-PCs, tablets and other ultra-mobile devices. Intel now has the all-dominating market share for high-end desktop and server CPUs. The fierce competition between Intel and AMD that has driven the development of x86 CPUs through several decades is now losing much of its force.

The improvements in computing power in recent years has come less from increased processing speed and more from increased parallelism. There are three kinds of parallelism being employed in modern microprocessors:

  1. Out-of-order execution of multiple instructions simultaneously.
  2. Single-Operation-Multiple-Data (SIMD) operations in vector registers.
  3. Multiple CPU cores on the same chip.
These kinds of parallelism have no theoretical limits, but real practical limits. The out-of-order execution is limited by the number of independent instructions in the software code. You cannot execute two instructions simultaneously if the second instruction depends on the output of the first instruction. Current CPUs can execute typically four instructions simultaneously. There would be little advantage in increasing this number further as it would be difficult or impossible for the CPU to find more independent instructions that it could execute simultaneously.

Current processors with the AVX2 instruction set have 16 vector registers of 256 bits each. The forthcoming AVX-512 instruction set gives us 32 vector registers of 512 bits each, and we can expect future extensions to 1024 or 2048-bit vectors. But these increases in vector size are subject to diminishing returns. Few calculation tasks have enough inherent parallelism to take full advantage of the bigger vector registers. The 512-bit vector registers are connected with a set of mask registers, which have a size limitation of 64 bits. A 2048-bit vector register will be able to hold 64 single-precision floating point numbers of 32 bits each. We can assume that Intel have no plans of making vector registers bigger than 2048 bits because they would not fit the 64-bit mask registers.

Multiple CPU cores is advantageous only if there are multiple speed-critical programs running simultaneously or if a time-consuming task can be divided into multiple independent threads. There is always a limit to how many threads a task can profitably be divided into.

The industry will no doubt keep trying to make more and more powerful computers, but what are the possibilities for obtaining still more computing power?

There is a fourth possibility for parallelism which has so far not been implemented. Software is typically full of if-else branches, and the CPUs are going to great lengths to predict which of the two branches it will take so that it can feed the predicted branch into the pipeline. It would be possible to execute multiple branches of code speculatively at the same time in order to avoid losing time when a branch prediction is mistaken. This, of course, will be at the cost of higher power consumption.

Another possible improvement is to include a programmable logic device on the CPU chip. The combination of a CPU and a programmable logic device is now common on the so-called FPGAs, used in advanced apparatuses. Such programmable logic devices in personal computers could be used for implementing application-specific functions for tasks like image processing, encryption, data compression and neural networks.

The semiconductor industry is experimenting with alternatives to silicon. Some III-V semiconductor materials can run at lower voltages and higher frequencies than silicon [3], but these materials are not making atoms smaller or the speed of light faster. The physical limits are still there.

Some day we may see 3-dimensional chips with many layers of transistors. This will make the circuits more compact with smaller distances and thus smaller delays. But how can you cool such a chip effectively when power is dissipated everywhere inside the chip? New cooling technology will be needed. The chip cannot supply power to all of its circuits at the same time without overheating. It will have to turn off most of its parts most of the time and supply power to each part only when it is in use.

The speed of CPUs has increased more than the speed of RAM memory in recent years. RAM speed is now often a serious bottleneck. We will no doubt see many attempts to increase RAM speed in the future. A likely development will be to put the RAM memory on the same chip as the CPU (or at least in the same housing) in order to decrease the distances for data transmission. This would be a useful application for 3-dimensional chips. The RAM would probably have to be a static type so that each RAM cell uses power only when it is accessed.

Intel is also catering to the market for supercomputers for scientific use. Intel's Knight's Corner processor has up to 61 CPU cores on a single chip. The Knight's Corner has a poor performance/price ratio, but the expected successor, Knights Landing, is expected to be better, with up to 72 cores and out-of-order processing capabilities. This is a small niche market, but it may give Intel some extra prestige.

The biggest potential for improved performance is now, as I see it, on the software side. Software producers have been quick to utilize the exponentially increasing power of modern computers that has been provided thanks to Moore's law. The software industry has taken advantage of the exponentially growing computing power and made use of more and more advanced development tools and software frameworks. These high-level development tools and frameworks have made it possible to develop new software products faster, but at the cost of consuming more processing power in the end product. Many of today's software products are quite wasteful in their excessive consumption of hardware computing power.

For many years, we have seen a symbiosis between the hardware industry and the software industry, where the software industry has produced ever-more advanced and demanding software that prompts consumers to buy the ever-more powerful hardware. As the rate of growth in hardware technology is slowing down, and consumers turn to small portable devices where battery life is more important than number-crunching power, the software industry now has to change its course. The software industry has to cut down on the resource-hungry development tools and multilayer software and develop less feature-bloated products that take longer time to develop, but use fewer hardware resources and run faster on the small low-power portable devices. If the commercial software industry fails to change its course now, it will quite likely lose market shares to the slimmer open source products.

References:

1. Rachel Courtland: Gordon Moore: The Man Whose Name Means Progress. IEEE Spectrum 30 Mar 2015.

2. Joel Hruska: Intel confirms 10nm delayed to 2017, will introduce ‘Kaby Lake’ at 14nm to fill gap. Extreme Tech, July 16, 2015.

3. Joel Hruska: Analyst: Intel will adopt quantum wells, III-V semiconductors at 10nm node. Extreme Tech, April 23, 2015.

   
Moores law hits the roof
Author:  Date: 2015-12-28 08:11
>There is a fourth possibility for parallelism which has so far not been implemented. Software is typically full of if-else branches, and the CPUs are going to great lengths to predict which of the two branches it will take so that it can feed the predicted branch into the pipeline. It would be possible to execute multiple branches of code speculatively at the same time in order to avoid losing time when a branch prediction is mistaken. This, of course, will be at the cost of higher power consumption.

This has been implemented by Intel Itanium.

   
Moores law hits the roof
Author: anon2718 Date: 2016-03-15 19:23
Agner wrote:
These kinds of parallelism have no theoretical limits
There are very real theoretical limits to multiprocessing. In particular, you can only access O(time^3) (abuse of notation, I know) "computational resources" (be they cores, units of memory, etc) - and that's assuming 3D construction. With 2D construction, you can only access O(time^2) "computational resources".

It's even worse when you consider that heat / power / information transfer / structural strength scale with area, not volume. At large enough scales, it actually bottlenecks at O(time^2) due to these effects.

Note that this may require reconsiderations as to which algorithms to use for various things. Anything that requires a "constant-time" lookup from anything larger than o(n^2) elements is no longer actually constant-time, for instance.