Vector Class Discussion

 
thread simple ray casting test slower when vectorized - epsilon - 2013-03-02
last replythread simple ray casting test slower when vectorized - Agner - 2013-03-03
last replythread simple ray casting test slower when vectorized - epsilon - 2013-03-03
last replythread simple ray casting test slower when vectorized - Agner - 2013-03-03
last reply simple ray casting test slower when vectorized - chad - 2013-03-14
 
simple ray casting test slower when vectorized
Author:  Date: 2013-03-02 12:57
Hi

first let me thank you for providing this very intersting vectorclass library !

I'd like to use the library for accelerating a ray casting application and did some preliminary testing.
Unfortunately, the vectorized loop (using vec3d,vec4d) runs two times slower than the non-vectorized version on my Sandy-Bridge Xeon, compiled with gcc 4.7:

Vec4d v_ray_pos( ray_pos[0], ray_pos[1], ray_pos[2], 0),
v_ray_dir( ray_dir[0], ray_dir[1], ray_dir[2], 0);
Vec4d v_pos(0,0,0,0);
double pos[3]={0,0,0};

double t,t0=0,t1=10000,dt=(t1-t0)/500000000.;

// non-vectorized loop
TIMER_START(&tm);
for(t=t0;t<t1;t+=dt)
{
for(k=0;k<3;k++) pos[k] += dt*ray_dir[k];
for(k=0;k<3;k++) pos[k] += dt*ray_dir[k];
for(k=0;k<3;k++) pos[k] += dt*ray_dir[k];
}
TIMER_STOP(&tm);

// vectorized loop
TIMER_START(&tm);
for(t=t0;t<t1;t+=dt)
{
v_pos += dt*v_ray_dir;
v_pos += dt*v_ray_dir;
v_pos += dt*v_ray_dir;
}
TIMER_STOP(&tm);

Am I doing something wrong here with the usage of the vectorclass types or how else could one explain the poor performance of the
vectorized loop compared to the plain array-based version ?

Thanks a lot for any help and insights !

P.S.: apparently gcc did not auto-vectorize the non-vectorized loop (at least it did not report so via -ftree-vectorizer-verbose=2)

   
simple ray casting test slower when vectorized
Author: Agner Date: 2013-03-03 02:27
epsilon wrote:
Unfortunately, the vectorized loop (using vec3d,vec4d) runs two times slower than the non-vectorized version on my Sandy-Bridge Xeon, compiled with gcc 4.7:
Try to look at the assembly output to see what the compiler is doing in each case. It is possible that the compiler has optimized away everything if the value of pos is not used anywhere.
   
simple ray casting test slower when vectorized
Author:  Date: 2013-03-03 05:57
ok, I further simplified my example (it seemed that in the non-vectorized code the compiler was able to condense multiple identical lines within the loop )

Now we have the following result,
which looks sensibly as expected, i.e. the vectorclass code using only a single packed "vaddpd" whereas the non-vectorized loop does three consequitive "vaddsd"s :

non-vectorized: double pos[3],ray_dir[3];
------------------------------------------
for(t=t0;t<t1;t+=dt)
{
for(k=0;k<3;k++) pos[k] += dt*ray_dir[k];
}

.L2:
vaddsd %xmm4, %xmm0, %xmm0
vaddsd %xmm8, %xmm3, %xmm3
vaddsd %xmm7, %xmm2, %xmm2
vucomisd %xmm0, %xmm5
vaddsd %xmm6, %xmm1, %xmm1
ja .L2

result: pos=330.082,9961.07,817.343 ----> CPU = 686.4 ms

Vectorized: Vec3d v_pos,v_ray_dir;
-----------------------------------

for(t=t0;t<t1;t+=dt)
{
v_pos += dt*v_ray_dir;
}
.L2:
vaddsd %xmm4, %xmm0, %xmm0
vaddpd %ymm2, %ymm1, %ymm1
vucomisd %xmm0, %xmm3
ja .L2

result: v_pos=330.082,9961.07,817.343 ----> CPU = 468.0 ms


Now this looks really good as it gives a speedup of about a factor of 1.5 nicely corresponding to the vectorized loop containing only 4 instructions vs. the 6 instructions of the scalar variant.

So with this actually proving the concept of a convenient 'zero-overhead' SIMD vector class I am going to convert parts of the 'real world' code of the main ray marcher application which will be interesting because it is probably going to be a tight race between RAM bandwidth and compute bandwdith (SIMD/AVX) as the ray marcher has to plough through dozens of gigabytes of voxel data (tricubically interpolated, i.e. requiring 64 memory fetches per voxel ;-)

BTW: Such applications like the dozens-of-GB-raymarcher are the main reason why I prefer many-core CPU + SIMD/AVX + 100+ GB of memory architecture over GPU-based solutions wich may be even faster compute-wise but still only provides 4 GB RAM ...

   
simple ray casting test slower when vectorized
Author: Agner Date: 2013-03-03 09:53
You can improve it further by using an integer as loop counter and using multiple accumulators, as explained in my C++ manual.
   
simple ray casting test slower when vectorized
Author:  Date: 2013-03-14 13:02
Hi Epsilon. I looked at your code and it looks to me like your not using SIMD optimally.

For ray casting you will get a lot more throughput by using packet based ray casting. In other words using an array of struct of arrays (AoSoA). For SSE this means you store four rays like xxxxyyyyzzzz. You should get a four times speed up with SSE and about 6x with AVX. See for example
graphics.stanford.edu/~boulos/papers/cook_gi07.pdf