Vector Class Discussion

 
thread Bilinear interpolation of images - Chad Jarvis - 2013-02-05
last replythread Bilinear interpolation of images - Agner - 2013-02-06
last reply Bilinear interpolation of images - Chad Jarvis - 2013-02-07
 
Bilinear interpolation of images
Author:  Date: 2013-02-05 05:56
Hi. I have implemented bilinear interpolation of images in C++ on the CPU. I found a blog that does this with SSE2 and SSE3 instructions which gives the fastest results.
fastcpp.blogspot.no/2011/06/bilinear-pixel-interpolation-using-sse.html#comment-form

I have tried to implement the same code using the vector class. However, the speed is less than I hoped for compared to the SSE native code on the blog. In fact using fixed point math is even faster than my code with the vectorclass. I thought you might be interested to see what I have done and perhaps you have some comments to improve my code. Below follows my implementation of GetPixelSSE and GitPixelSSE3 functions on the fastcpp blog.

inline Vec4f CalcWeights_vector(float x, float y) {
Vec4f v1(x);
v1.insert(1,y);
Vec4f v2 = floor(v1);
Vec4f frac = v1 - v2; // dx dy X X
Vec4f frac1 = 1 - frac; // 1-dx, 1-dy X X
Vec4f w_x = blend4f<4, 0, 4, 0>(frac, frac1);
Vec4f w_y = blend4f<5, 5, 1, 1>(frac, frac1);

return w_x * w_y;
}

int GetPixelSSE(const int* data, float u, float v, const int src_width, const int src_height, Vec4f& weights) {
Vec16uc p12x;
Vec16uc p34x;
//Vec16uc p12x = load_partial2(&data[0]);
//Vec16uc p34x = load_partial2(&data[src_width]);

p12x.load(&data[0]);
//p12x.load_partial(8, &data[0]);
p34x.load(&data[src_width]);
//p34x.load_partial(8, &data[src_width]);

Vec8us p12 = extend_low(p12x);
Vec8us p34 = extend_low(p34x);
weights*=256;
Vec4i weighti = round_to_int(weights);
Vec8us weighti2 = compress(weighti, 0);
Vec8us w12 = permute8us<0,0,0,0,1,1,1,1>((Vec8us)weighti2);
Vec8us w34 = permute8us<2,2,2,2,3,3,3,3>((Vec8us)weighti2);

Vec8us L1234 = w12*p12 + w34*p34;
Vec8us Lhi = permute8us<4,5,6,7,-256,-256,-256,-256>(L1234);
L1234 +=Lhi;
L1234/=256;
Vec16uc L = compress(L1234, 0);
Vec4ui out = L;
return out[0];
}

int GetPixelSSE3(const int* data, float u, float v, const int src_width, const int src_height, Vec4f& weights) {
Vec4ui row1, row2;
row1.load(&data[0]);
row2.load(&data[src_width]);
Vec4ui aos = blend4ui<0, 1, 4, 5>(row1, row2);
Vec16uc soa = permute16uc<0,4,8,12, 1,5,9,13, 2,6,10,14, 3,7,11,15>((Vec16uc)aos); //AoS to SoA

Vec8us rg = extend_low(soa);
Vec8us ba = extend_high(soa);

Vec4ui redv = extend_low(rg);
Vec4ui greenv = extend_high(rg);
Vec4ui bluev = extend_low(ba);
weights*=256;
Vec4ui wi = round_to_int(weights);

//no mm_madd in vectorclass

int red = horizontal_add(redv*wi)/256;
int green = horizontal_add(greenv*wi)/256;
int blue = horizontal_add(bluev*wi)/256;
int color = red + (green << 8) + (blue << 16);
return color;
}

   
Bilinear interpolation of images
Author: Agner Date: 2013-02-06 00:11
Chad Jarvis wrote:
I have implemented bilinear interpolation of images in C++
Your code spends most of the time moving data around and converting between float, 8-bit, 16-bit and 32-bit integers. You may think about whether the data can be organized differently so that you don't need so much moving around, reordering and conversion.

Your functions return a single pixel. You may think about whether the data can be organized so that you keep everything in vectors and return a vector of multiple pixels, if this can reduce the number of conversions.

Division by 256 can be done faster by (unsigned) shift right by 8. (Or divide by const_uint(256)).
Multiplication by 256 is done faster by shift left by 8.
Permutation and blend is very slow unless you enable instruction set SSSE3 or higher.

You should add red, green and blue before doing the horizontal add, so that you only need one horizontal add.

   
Bilinear interpolation of images
Author:  Date: 2013-02-07 04:32
I manged to optimize the code some more and now it's almost as fast as the intrinsic code.

I changed "L1234/=256 first to L1234/=const_uint(256)" and finally to "L1234>>8". That was the first big improvement but the next one was unexpected.

I changed most vectors from unsigned to sign and used "compress_unsaturated" instead of "compress". I figured this out by looking at "vectori128.h" and the intrinsic code. Compress_unsaturated on signed vectors only uses one instruction (the same as the intrinsic code) whereas compress uses at least three. Compress_unsaturated on unsigned uses many more. In your manual you write for the efficiency of compress_unsaturated "medium (worse than compress in most cases)". I don't understand this because for signed integer based vectors (Vec4i, Vec8s, Vec16c) compress_unsaturated is simpler and faster.

Below is the code that gets closest to the intrinsic code.

int GetPixelSSE(const int* data, float u, float v, const int src_width, const int src_height, Vec4f& weights) {
Vec16uc p12x;
Vec16uc p34x;

p12x.load(&data[0]);
p34x.load(&data[src_width]);

Vec8s p12 = extend_low(p12x);
Vec8s p34 = extend_low(p34x);
weights*=256; //weights contains floats with values [0,1]. I have to multiply before I convert to integers.
Vec4i weighti = round_to_int(weights);
Vec8s weighti2 = compress_saturated(weighti, 0);

Vec8s w12 = permute8s<0,0,0,0,1,1,1,1>(weighti2);
Vec8s w34 = permute8s<2,2,2,2,3,3,3,3>(weighti2);

Vec8s L1234 = w12*p12 + w34*p34;
Vec8s Lhi = permute8s<4,5,6,7,-256,-256,-256,-256>(L1234);
L1234 +=Lhi;
L1234 >>= 8;
Vec16c L = compress_saturated(L1234, 0);
Vec4i out = L;
return out[0];
}