Vector Class Discussion

New C++ vector class library
Author:  Date: 2012-06-24 16:23
Nick wrote:
Agner wrote:
The GPU is still much faster than a CPU with AVX2, so OpenCL will still be useful for some purposes with massively parallel data.
I don't think that's generally true. A quad-core Haswell GT2 chip will have close to 500 GFLOPS on the CPU side, but only about 400 GFLOPS on the GPU side. Also, homogeneous computing is inherently more efficient due to requiring less data movement. And a GPU can easily stall due to a lack of out-of-order execution, insufficient parallelism, register space limits, and/or (shared) data bandwidth bottlenecks. The GTX 680 (3 TFLOPS) even fails to outperform an i7-3820 (230 GFLOPS): LuxMark OpenCL rendering!

So the raw computing power is a bad indicator for effective performance. It will be much easier to achieve good performance out of AVX2.

In the linked benchmark results the same GTX 680 is far behind a GCN based 7970 (283 vs. 1010). Even the old 5870 and 580 are significantly faster. Either the Luxmark code is hitting some specific Kepler bottleneck or it just needs target optimizations for each new architecture. Neither of these points would invalidate an advantage of GPGPU in this particular benchmark. Especially since the 680's GPU is targeted at gaming, while the GK110 is optimized for computing.

The main difference between using different kinds of architectures in combination is not a competition in raw flops, but a simple way to have the right computation hardware in place for different types of problems. The highly tuned CPUs have a lot of overhead per processed instruction. They are doing fine with serial or branchy code and in resolving dependencies. SIMD was a way to reduce the per-instruction overhead (which also means power consumption) by spreading it over a bunch of similar operations. So the longer the SIMD vectors, the more power efficent the computations. But GPUs which are more and more optimized for computations are able to do a lot of operations in parallel and keep the per operation overhead really low.

There will be FP code where a CPU reaches high efficiency (e.g. Linpack, DGEMM), but there will also be code having much lower throughput. And since CPUs only process a few threads at once and only 2 per core, they run into difficulties keeping the throughput at high levels with such code. A GPU might drop in efficiency when executing some branches and having to mask computations, but in general it provides a lot of raw mem and cache bandwidth, lots of registers, ALUs and handles very many threads in parallel to execute some other thread if one is stalling.

I added some known numbers and estimated capabilities of a Haswell core to the table in the linked RWT article.

And here are some presentations by Intel, which shows how they use OpenCL on their CPUs:
Presentation on SIGGRAPH 2010
Intel OpenCL SDK Vectorizer
 
thread New C++ vector class library new - Agner - 2012-05-30
replythread New C++ vector class library new - AVK - 2012-06-04
reply New C++ vector class library new - Agner - 2012-06-05
last replythread New C++ vector class library new - Nick - 2012-06-13
last replythread New C++ vector class library new - Agner - 2012-06-13
last replythread New C++ vector class library new - Nick - 2012-06-13
last reply New C++ vector class library - Matthias - 2012-06-24
last reply New C++ vector class library new - Stefan - 2012-06-08