Vector Class Discussion

Bilinear interpolation of images
Author:  Date: 2013-02-07 04:32
I manged to optimize the code some more and now it's almost as fast as the intrinsic code.

I changed "L1234/=256 first to L1234/=const_uint(256)" and finally to "L1234>>8". That was the first big improvement but the next one was unexpected.

I changed most vectors from unsigned to sign and used "compress_unsaturated" instead of "compress". I figured this out by looking at "vectori128.h" and the intrinsic code. Compress_unsaturated on signed vectors only uses one instruction (the same as the intrinsic code) whereas compress uses at least three. Compress_unsaturated on unsigned uses many more. In your manual you write for the efficiency of compress_unsaturated "medium (worse than compress in most cases)". I don't understand this because for signed integer based vectors (Vec4i, Vec8s, Vec16c) compress_unsaturated is simpler and faster.

Below is the code that gets closest to the intrinsic code.

int GetPixelSSE(const int* data, float u, float v, const int src_width, const int src_height, Vec4f& weights) {
Vec16uc p12x;
Vec16uc p34x;

p12x.load(&data[0]);
p34x.load(&data[src_width]);

Vec8s p12 = extend_low(p12x);
Vec8s p34 = extend_low(p34x);
weights*=256; //weights contains floats with values [0,1]. I have to multiply before I convert to integers.
Vec4i weighti = round_to_int(weights);
Vec8s weighti2 = compress_saturated(weighti, 0);

Vec8s w12 = permute8s<0,0,0,0,1,1,1,1>(weighti2);
Vec8s w34 = permute8s<2,2,2,2,3,3,3,3>(weighti2);

Vec8s L1234 = w12*p12 + w34*p34;
Vec8s Lhi = permute8s<4,5,6,7,-256,-256,-256,-256>(L1234);
L1234 +=Lhi;
L1234 >>= 8;
Vec16c L = compress_saturated(L1234, 0);
Vec4i out = L;
return out[0];
}

 
thread Bilinear interpolation of images new - Chad Jarvis - 2013-02-05
last replythread Bilinear interpolation of images new - Agner - 2013-02-06
last reply Bilinear interpolation of images - Chad Jarvis - 2013-02-07