Vector Class Discussion

 
thread Proposal for vector gather() method - epsilon - 2013-04-02
last replythread Proposal for vector gather() method - Agner - 2013-04-03
last replythread Proposal for vector gather() method - epsilon - 2013-04-03
last reply Proposal for vector gather() method - Agner - 2013-04-03
 
Proposal for vector gather() method
Author: epsilon Date: 2013-04-02 16:26
I would like to propose a gather() method as an extension to the vectorclass library that would allow to load a vector elements from
arbitrary memory locations specified by an index which is itself a vector register.
I have found such a gather() instruction to be an invaluable tool for 'real-world' applications where the data items are not usually distributed contigously in memory.
Unfortunately the gather() instruction is missing in all instruction sets prior to AVX2 so it would have to be emulated in software something like this:

inline __m128 vgather4(const float *base, const __m128i &indices)
{
__m128 res = _mm_load_ss(base+_mm_cvtsi128_si32(indices));
res = _mm_insert_ps(res,_mm_load_ss(base+_mm_extract_epi32(indices,1)),_MM_MK_INSERTPS_NDX(0,1,0));
res = _mm_insert_ps(res,_mm_load_ss(base+_mm_extract_epi32(indices,2)),_MM_MK_INSERTPS_NDX(0,2,0));
res = _mm_insert_ps(res,_mm_load_ss(base+_mm_extract_epi32(indices,3)),_MM_MK_INSERTPS_NDX(0,3,0));
return res;
}

inline __m256 vgather8(const float *base, const Vec8i &indices)
{
const __m128 low = vgather4(base,indices.get_low()),
high = vgather4(base,indices.get_high());
return _mm256_insertf128_ps(_mm256_castps128_ps256(low),high,1);
}

This looks like a lot of overhead (it compiles to 18 assembly instructions to basically load 8 values from memory so if one does only
one or two computations after loading a vector, there will be little performance benefit.
However there are many usage cases where data can be re-used many times for computations once loaded in a vector register so there is potential for significant speedup with the gather() instruction (for example I was able to get a speedup of a factor of three for a 3D grid tri-cubic interpolation code for fluid simulation advection using the above gather() instruction on top of <vectorclass.h>)

I wonder however if there may be a faster method for emulating gather() with pre-AVX2 means ?
(under AVX1 one could use an implementation basd on _mm256_blend_ps/_mm256_broadcast_ss but I am not sure if this would be faster though)

Usage example: simultaneously loading 8 pixels at 8 arbitrary coordinate pairs (ix,iy) from a 2D grid:

Vec8i ix,iy,grid_width;
float *p_grid_base;
...
ix = min(max(ix,0),grid_width); ... // calculate & clamp coords etc.
Vec8i idx = ix + iy * grid_width; // calc indices
Vec8f pixel_value = vgather8( p_grid_base, idx ); // load 8 pixels in parallel

   
Proposal for vector gather() method
Author: Agner Date: 2013-04-03 00:06
epsilon wrote:
I would like to propose a gather() method
There is such as function. It's called lookup<n>.
I think it can do what you are asking for. Maybe it would be useful to have a gather function with fixed indices, like this:
Vec4f x = gather<3,5,8,7>(array);
   
Proposal for vector gather() method
Author:  Date: 2013-04-03 04:23
I did consider the lookup<n> function but from the documentation and source code it looks like that function is mainly intended for fixed, small numbers of <n>.

For arbitrary large n, the implementation in vectorf256.h is based on store and re-load from temp. memory locations

uint32_t ii[8]; index1.store(ii);
float rr[8];
for (int j = 0; j < 8; j++) {
rr[j] = table[ii[j]];
}
return Vec8f().load(rr);

which from my experience is generally less efficient than directly working with shuffles etc. in the registers themselves.

So I thought it would be nice to have reasonably fast, general purpose gather() for arbitrary index vectors, analogous to the AVX2 VGATHER instruction ....

   
Proposal for vector gather() method
Author: Agner Date: 2013-04-03 07:50
I agree that this method is inefficient because of a store forwarding stall (see my microarchitecture manual for an explanation of this). We don't need another function, just a better implementation of lookup<n>