Vector Class Discussion

 
thread New C++ vector class library - Agner - 2012-05-30
replythread New C++ vector class library - AVK - 2012-06-04
reply New C++ vector class library - Agner - 2012-06-05
last replythread New C++ vector class library - Nick - 2012-06-13
last replythread New C++ vector class library - Agner - 2012-06-13
last replythread New C++ vector class library - Nick - 2012-06-13
last reply New C++ vector class library - Matthias - 2012-06-24
last reply New C++ vector class library - Stefan - 2012-06-08
 
New C++ vector class library
Author: Agner Date: 2012-05-30 09:08
Great news. I have made a new vector class library that makes it easier to use the vector instruction sets from SSE2 to AVX and AVX2. It's a C++ library that defines a lot of vector classes, functions and operators. Adding two vectors is as simple as writing a + sign instead of using assembly code or intrinsic functions. This is useful where the compiler doesn't vectorize your code automatically. The resulting code has no extra overhead when compiled with an optimizing compiler.

This library has much more features than Intel's vector classes:

Features

  • vectors of 8, 16, 32 and 64-bit integers, signed and unsigned
  • vectors of single and double precision floating point numbers
  • total vector size 128 or 256 bits
  • defines almost all common operators
  • boolean operations and branches on vector elements
  • defines many arithmetic functions
  • permute, blend and table-lookup functions
  • fast integer division
  • many mathematical functions (requires external library)
  • can build code for different instruction sets from the same source code
  • CPU dispatching to utilize higher instruction sets when available
  • uses metaprogramming (including preprocessing directives and templates) to find the best implementation for the selected instruction set and parameter values of a given operator or function

Take a look at www.agner.org/optimize/#vectorclass and have fun!

   
New C++ vector class library
Author: AVK Date: 2012-06-04 04:32
Nice work, but, frankly, this library should have been released somewhen in 1998-99, when AMD and Intel have released their first CPUs with 3DNow! and SSE, respectively. Nowadays, OpenCL is much more relevant.
   
New C++ vector class library
Author: Agner Date: 2012-06-05 00:52
AVK wrote:
Nice work, but, frankly, this library should have been released somewhen in 1998-99, when AMD and Intel have released their first CPUs with 3DNow! and SSE, respectively. Nowadays, OpenCL is much more relevant.
I agree that this should have been done long ago. In fact, Intel did publish their vector class header files several years ago, but they didn't have enough features to really be useful.

OpenCL is a cross-platform language while my vector class library is carefully tweaked to the x86 and x86-64 instruction set extensions SSE2 through AVX2. It is intended as a tool to make it easier to use these specific instruction set extensions and get the maximum performance out of them without adding any extra runtime overhead. It works with some of the best optimizing C++ compilers available and can be combined with existing C++ code.

The advantage of OpenCL is that it can utilize the GPU. The main disadvantage is that it is a separate programming language and it doesn't have C++ constructs. A cross-platform language always has to make compromises for the sake of generality and portability.

OpenCL allows you to use types and vector sizes that don't fit the hardware registers. It even allows you to use types that work poorly on a specific platform, such as double precision on platforms that support only single precision.

In conclusion, OpenCL and my vector class library serve very different purposes.

   
New C++ vector class library
Author: Nick Date: 2012-06-13 00:15
AVK wrote:
Nice work, but, frankly, this library should have been released somewhen in 1998-99, when AMD and Intel have released their first CPUs with 3DNow! and SSE, respectively. Nowadays, OpenCL is much more relevant.
OpenCL might actually be short-lived once AVX2 processors become available. The thing is, AVX2 can be used with any programming language. All you need is loops with independent iterations, to auto-vectorize them in an SPMD fashion. So why would developers use OpenCL, when they can keep using their favorite programming language instead to achieve up to an eightfold increase in performance?

It won't die out overnight, but OpenCL can only evolve into a fully unrestricted generic programming language. The ironic part is we already have that; it's called C! So why not skip that awkward "evolution to the past", and just use vectorizing compilers? With AVX2 they'll be much more effective than ever before since the gather instruction enables the same kind of parallelization as what's done on the GPU. And it can support many more languages than just C.

So I'm afraid OpenCL is a new standard where none was needed. We just needed homogeneous high throughput computing instruction set extensions like AVX2.

   
New C++ vector class library
Author: Agner Date: 2012-06-13 02:00
Nick wrote:
OpenCL might actually be short-lived once AVX2 processors become available.
The GPU is still much faster than a CPU with AVX2, so OpenCL will still be useful for some purposes with massively parallel data.

BTW. My vector class library supports AVX2.

   
New C++ vector class library
Author: Nick Date: 2012-06-13 08:00
Agner wrote:
The GPU is still much faster than a CPU with AVX2, so OpenCL will still be useful for some purposes with massively parallel data.
I don't think that's generally true. A quad-core Haswell GT2 chip will have close to 500 GFLOPS on the CPU side, but only about 400 GFLOPS on the GPU side. Also, homogeneous computing is inherently more efficient due to requiring less data movement. And a GPU can easily stall due to a lack of out-of-order execution, insufficient parallelism, register space limits, and/or (shared) data bandwidth bottlenecks. The GTX 680 (3 TFLOPS) even fails to outperform an i7-3820 (230 GFLOPS): LuxMark OpenCL rendering!

So the raw computing power is a bad indicator for effective performance. It will be much easier to achieve good performance out of AVX2.

BTW. My vector class library supports AVX2.
Yes, thanks for that, it should come in quite handy!
   
New C++ vector class library
Author:  Date: 2012-06-24 16:23
Nick wrote:
Agner wrote:
The GPU is still much faster than a CPU with AVX2, so OpenCL will still be useful for some purposes with massively parallel data.
I don't think that's generally true. A quad-core Haswell GT2 chip will have close to 500 GFLOPS on the CPU side, but only about 400 GFLOPS on the GPU side. Also, homogeneous computing is inherently more efficient due to requiring less data movement. And a GPU can easily stall due to a lack of out-of-order execution, insufficient parallelism, register space limits, and/or (shared) data bandwidth bottlenecks. The GTX 680 (3 TFLOPS) even fails to outperform an i7-3820 (230 GFLOPS): LuxMark OpenCL rendering!

So the raw computing power is a bad indicator for effective performance. It will be much easier to achieve good performance out of AVX2.

In the linked benchmark results the same GTX 680 is far behind a GCN based 7970 (283 vs. 1010). Even the old 5870 and 580 are significantly faster. Either the Luxmark code is hitting some specific Kepler bottleneck or it just needs target optimizations for each new architecture. Neither of these points would invalidate an advantage of GPGPU in this particular benchmark. Especially since the 680's GPU is targeted at gaming, while the GK110 is optimized for computing.

The main difference between using different kinds of architectures in combination is not a competition in raw flops, but a simple way to have the right computation hardware in place for different types of problems. The highly tuned CPUs have a lot of overhead per processed instruction. They are doing fine with serial or branchy code and in resolving dependencies. SIMD was a way to reduce the per-instruction overhead (which also means power consumption) by spreading it over a bunch of similar operations. So the longer the SIMD vectors, the more power efficent the computations. But GPUs which are more and more optimized for computations are able to do a lot of operations in parallel and keep the per operation overhead really low.

There will be FP code where a CPU reaches high efficiency (e.g. Linpack, DGEMM), but there will also be code having much lower throughput. And since CPUs only process a few threads at once and only 2 per core, they run into difficulties keeping the throughput at high levels with such code. A GPU might drop in efficiency when executing some branches and having to mask computations, but in general it provides a lot of raw mem and cache bandwidth, lots of registers, ALUs and handles very many threads in parallel to execute some other thread if one is stalling.

I added some known numbers and estimated capabilities of a Haswell core to the table in the linked RWT article.

And here are some presentations by Intel, which shows how they use OpenCL on their CPUs:
Presentation on SIGGRAPH 2010
Intel OpenCL SDK Vectorizer
   
New C++ vector class library
Author:  Date: 2012-06-08 04:44
This is just ... sweet. I have tried few times to write something similar, but nothing at this level.