chad wrote:
Would you consider adding a
function which call the non-temporal store
instructions
That would be possible, but I don't know how useful it would be. Nontemporal stores are rarely optimal and I wouldn't expect the average programmer to know when they are. The programmer would have to check the cache size and use nontemporal stores when writing memory blocks bigger than half the size of the last level cache. Writing directly to video ram may be another application, but I don't think it is safe to use vector classes in a device driver. You are free to make your own extensions, of course, or use the intrinsic functions directly.
The compiler will automatically combine a floating
point multiplication and a subsequent addition or
subtraction into a single instruction ...
But more importantly
is that I have never observed this in GCC with the
VCL.
You are right. GCC is not as good as I thought. The GCC developers are actually struggling with finding a solution to this problem, see gcc.gnu.org/bugzilla/show_bug.cgi?id=56253
It makes more sense to me
to have these mul_add (and variants such as mul_sub
and mul_add_x) functions as part of the core of the
VLC instead of in a separate header
Good point. It depends on how long we have to wait for GCC and other compilers to implement this optimization. I will think about it. |