Vector Class Discussion

FMA and non temporal stores
Author:  Date: 2014-09-22 06:44
Hi Agner,

I noticed that you added inline implementations of the math functions to your VCL. Before you did this the alternative was Intel's expensive and closed source SVML or AMD's free but closed source LIBM neither of which work well on the competitors hardware, or various open source libraries using intrinsics which only worked for either single or double or only SSE and not AVX and so forth. Having one SIMD library which works well on both Intel and AMD for single and double and SSE, AVX, FMA3, FMA4, and AVX512 is one of the best features of your VCL.

Additionally, I noticed that you added the to_bits function which calls the movemask instructions to your the VCL which I appreciate.

I have one suggestion for your VCL. Would you consider adding a function which call the non-temporal store instructions: _mm_stream_ps, _mm266_stream_ps, _mm_stream_si128,... to your VCL? For example Vec8f().stream(x) would use _mm266_stream_ps(x).

One of my favourite additions to the VLC I have discovered is the mul_add (and mul_add_x) functions. The rest of my comments are about these functions. In your manual you write:

The FMA3 and FMA4 instruction sets are not handled directly by the code in the vector class library, but by the compiler.
The compiler will automatically combine a floating point multiplication and a subsequent addition or subtraction into a
single instruction, unless you have specified a strict floating point model.

But this seems to contradict the mul_add functions. But more importantly is that I have never observed this in GCC with the VCL. I have observed this using GCC's own vector extensions but the VLC is built from intrinsics and GCC treats intrinsics like inline assembly (usually). So the VLC never generates FMA3 instructions in GCC except with functions such as mul_add which explicitly use the FMA intrinsics. So I don't understand this paragraph in your manual.

It makes more sense to me to have these mul_add (and variants such as mul_sub and mul_add_x) functions as part of the core of the VLC instead of in a separate header (vectormath_common.h) which needs to be included.

Kind Regards,
Chad

 
thread FMA and non temporal stores - chad - 2014-09-22
last reply FMA and non temporal stores new - Agner - 2014-09-24