Vector Class Discussion

Performance of decimation v blending

Author:

Date: 2016-05-03 09:38

Hello,

I've been fighting the following problem for weeks and about to give up. But I still hope that I miss something.

The problem originates from a fundamental signal processing task: transpose a 4x8 matrix of 16-bit unsigned short integers and its inverse (i.e. transpose a 4x8 matrix).



typedef Vec8us W8T;void transpo (W8T y[4], W8T const x[4])

{   // Transpose the 8*4 matrix x and store result into 4*8 matrix y

    W8T x0,x1,x2,x3,y0,y1,y2,y3;

    x0.load( &x[0] );

    x1.load( &x[1] );

    x2.load( &x[2] );

    x3.load( &x[3] );

    y0 = blend8us<0,2,4,6,8,10,12,14>(x0,x1);

    y1 = blend8us<1,3,5,7,9,11,13,15>(x0,x1);

    y2 = blend8us<0,2,4,6,8,10,12,14>(x2,x3);

    y3 = blend8us<1,3,5,7,9,11,13,15>(x2,x3);

    x0 = blend8us<0,2,4,6,8,10,12,14>(y0,y2);

    x2 = blend8us<1,3,5,7,9,11,13,15>(y0,y2);

    x1 = blend8us<0,2,4,6,8,10,12,14>(y1,y3);

    x3 = blend8us<1,3,5,7,9,11,13,15>(y1,y3);

    x0.stor( &y[0] );

    x1.stor( &y[1] );

    x2.stor( &y[2] );

    x3.stor( &y[3] );

}
void transpi (W8T y[4], W8T const x[4])

{   // Transpose the 4*8 matrix x and store result into 8*4 matrix y

    W8T x0,x1,x2,x3,y0,y1,y2,y3;

    x0.load( &x[0] );

    x1.load( &x[1] );

    x2.load( &x[2] );

    x3.load( &x[3] );

    y0 = blend8us<0, 8,1, 9,2,10,3,11>(x0,x2);

    y2 = blend8us<4,12,5,13,6,14,7,15>(x0,x2);

    y1 = blend8us<0, 8,1, 9,2,10,3,11>(x1,x3);

    y3 = blend8us<4,12,5,13,6,14,7,15>(x1,x3);

    x0 = blend8us<0, 8,1, 9,2,10,3,11>(y0,y1);

    x1 = blend8us<4,12,5,13,6,14,7,15>(y0,y1);

    x2 = blend8us<0, 8,1, 9,2,10,3,11>(y2,y3);

    x3 = blend8us<4,12,5,13,6,14,7,15>(y2,y3);

    x0.stor( &y[0] );

    x1.stor( &y[1] );

    x2.stor( &y[2] );

    x3.stor( &y[3] );

}

Nice, but... benchmark shows that transpo() is substantially slower than transpi(), at least for SSSE3 instruction set. For example, when compiled under gcc -march=core2 and run on a Core 2 Duo E4500 processor, it is more than twice slower.

After many trials with this and about half dozen more versions of transpo(), I'm about to conclude that (within vectorclass library) one of the two fundamental operations, the _decimation_



    y0 = blend8us<0,2,4,6,8,10,12,14>(x0,x1);

    y1 = blend8us<1,3,5,7,9,11,13,15>(x0,x1);

is substantially slower than its countperpart, the _blending_



    y0 = blend8us<0, 8,1, 9,2,10,3,11>(x0,x1);

    y1 = blend8us<4,12,5,13,6,14,7,15>(x0,x1);

A temporary workaround for decimation is via blending:



void transpo (W8T y[4], W8T const x[4])

{   // Transpose the 8*4 matrix x and store result into 4*8 matrix y

    W8T x0,x1,x2,x3,y0,y1,y2,y3;

    x0.load( &x[0] );

    x1.load( &x[1] );

    x2.load( &x[2] );

    x3.load( &x[3] );

    y0 = blend8us<0, 8,1, 9,2,10,3,11>(x0,x2);

    y2 = blend8us<4,12,5,13,6,14,7,15>(x0,x2);

    y1 = blend8us<0, 8,1, 9,2,10,3,11>(x1,x3);

    y3 = blend8us<4,12,5,13,6,14,7,15>(x1,x3);

    x0 = blend8us<0, 8,1, 9,2,10,3,11>(y0,y1);

    x1 = blend8us<4,12,5,13,6,14,7,15>(y0,y1);

    x2 = blend8us<0, 8,1, 9,2,10,3,11>(y2,y3);

    x3 = blend8us<4,12,5,13,6,14,7,15>(y2,y3);

    y0 = blend8us<0, 8,1, 9,2,10,3,11>(x0,x2);

    y1 = blend8us<4,12,5,13,6,14,7,15>(x0,x2);

    y2 = blend8us<0, 8,1, 9,2,10,3,11>(x1,x3);

    y3 = blend8us<4,12,5,13,6,14,7,15>(x1,x3);

    y0.stor( &y[0] );

    y1.stor( &y[1] );

    y2.stor( &y[2] );

    y3.stor( &y[3] );

}

which (of course) is only 1.5 times slower than blending.

But I don't think it is the best vectorclass can do. I'm right in that?

Reply To This Message

Next Message

Performance of decimation v blending - Duong Tran - 2016-05-03

Performance of decimation v blending new - Agner - 2016-05-03