I'm mostly finished tuning hsums for __m128i vectors. For horizontal_add_x(Vec16c), we can range-shift to unsigned and use psadbw, so that's a huge improvement.
Many of the _x functions do one step of extend/add and then just call the normal horizontal_add function for the next wider width.
I removed all the slow phadd code. In some cases, I changed things to avoid movdqa in the SSE2 / SSE4 versions without AVX. With AVX, it mostly just saves code-size, and maybe increases ILP.
For CPUs with slow shuffles (like Merom), there should be nice improvements from using pshuflw instead of pshufd when possible.
Anyway, I pushed stuff up to github. I have *not* turned my changes into a nice patch-series, so all the mess of development is there. I can re-factor the commits into a series of clean commits if that's useful, but you don't use public version-control for the library so IDK if it would benefit anything long-term.
I still haven't really looked at float or 256b vectors yet, but I'd like your comments on coding-style and how much detail to put in comments before I start on those. |