Vector Class Discussion

 
thread FMA and non temporal stores - chad - 2014-09-22
last reply FMA and non temporal stores - Agner - 2014-09-24
 
FMA and non temporal stores
Author:  Date: 2014-09-22 06:44
Hi Agner,

I noticed that you added inline implementations of the math functions to your VCL. Before you did this the alternative was Intel's expensive and closed source SVML or AMD's free but closed source LIBM neither of which work well on the competitors hardware, or various open source libraries using intrinsics which only worked for either single or double or only SSE and not AVX and so forth. Having one SIMD library which works well on both Intel and AMD for single and double and SSE, AVX, FMA3, FMA4, and AVX512 is one of the best features of your VCL.

Additionally, I noticed that you added the to_bits function which calls the movemask instructions to your the VCL which I appreciate.

I have one suggestion for your VCL. Would you consider adding a function which call the non-temporal store instructions: _mm_stream_ps, _mm266_stream_ps, _mm_stream_si128,... to your VCL? For example Vec8f().stream(x) would use _mm266_stream_ps(x).

One of my favourite additions to the VLC I have discovered is the mul_add (and mul_add_x) functions. The rest of my comments are about these functions. In your manual you write:

The FMA3 and FMA4 instruction sets are not handled directly by the code in the vector class library, but by the compiler.
The compiler will automatically combine a floating point multiplication and a subsequent addition or subtraction into a
single instruction, unless you have specified a strict floating point model.

But this seems to contradict the mul_add functions. But more importantly is that I have never observed this in GCC with the VCL. I have observed this using GCC's own vector extensions but the VLC is built from intrinsics and GCC treats intrinsics like inline assembly (usually). So the VLC never generates FMA3 instructions in GCC except with functions such as mul_add which explicitly use the FMA intrinsics. So I don't understand this paragraph in your manual.

It makes more sense to me to have these mul_add (and variants such as mul_sub and mul_add_x) functions as part of the core of the VLC instead of in a separate header (vectormath_common.h) which needs to be included.

Kind Regards,
Chad

   
FMA and non temporal stores
Author: Agner Date: 2014-09-24 01:10
chad wrote:
Would you consider adding a function which call the non-temporal store instructions
That would be possible, but I don't know how useful it would be. Nontemporal stores are rarely optimal and I wouldn't expect the average programmer to know when they are. The programmer would have to check the cache size and use nontemporal stores when writing memory blocks bigger than half the size of the last level cache. Writing directly to video ram may be another application, but I don't think it is safe to use vector classes in a device driver. You are free to make your own extensions, of course, or use the intrinsic functions directly.

The compiler will automatically combine a floating point multiplication and a subsequent addition or subtraction into a single instruction ... But more importantly is that I have never observed this in GCC with the VCL.
You are right. GCC is not as good as I thought. The GCC developers are actually struggling with finding a solution to this problem, see gcc.gnu.org/bugzilla/show_bug.cgi?id=56253

It makes more sense to me to have these mul_add (and variants such as mul_sub and mul_add_x) functions as part of the core of the VLC instead of in a separate header
Good point. It depends on how long we have to wait for GCC and other compilers to implement this optimization. I will think about it.