Choosing the most efficient function library can be a
nightmare to a programmer. I have tried to calculate the cosine function with
different libraries and compare the calculation time. The best version is 19
times faster than the worst!
AMD have now updated their math libraries and added CPU
dispatching. There are two versions of code in AMD's
LIBM
library: One for the SSE2 instruction set and one for AVX and FMA4. Intel processors will run the
inferior SSE2 branch because they don't have the
FMA4 instruction set. The incompatibility between Intel's and AMD's FMA
instructions is another scandal, which I have discussed in
this blog post.
The AMD library does not check the CPU brand name as Intel libraries do. It
only checks for the FMA4 instructions which are not supported by Intel
processors, although - quite ironically - they were designed by Intel. It will
be possible to run the better branch on Intel processors if Intel decides to
support the FMA4 instruction set in the future.
The following table shows the results of some tests I have
made of different math libraries.
| Library |
Elements per vector |
Dispatch version |
Time, 32 bit mode |
Time, 64 bit mode |
| glibc 2.13 |
1 |
none |
1400 |
1030 |
| MS 17.00 |
1 |
SSE2 |
724 |
200 |
| Intel 12.1.3 |
1 |
generic |
1950 |
303 |
| Intel 12.1.3 |
1 |
Intel |
720 |
295 |
| Intel SVML 12.1.3 |
4 |
generic (SSE2) |
360 |
203 |
| Intel SVML 12.1.3 |
4 |
Intel AVX |
99 |
188 |
| Intel SVML 12.1.3 |
8 |
generic (AVX) |
112 |
128 |
| Intel SVML 12.1.3 |
8 |
Intel AVX |
108 |
101 |
| AMD LIBM 3.0.2 |
4 |
generic (SSE2) |
n.a. |
245 |
| AMD LIBM 3.0.2 |
4 |
FMA4 |
n.a. |
148 |
Calculation time in clock cycles for the cosine of a
vector of 8 single-precision floats on an AMD Bulldozer CPU in a single
thread with different function libraries.
(values are imprecise due to the varying clock frequency). |
The Gnu function library (glibc) uses an outdated and
inefficient code. The Microsoft library has decent performance in the 64-bit
version, but of course it supports only the Windows platform. Intel's general
math library is no better in my test case, but Intel's Short Vector Math Library
(SVML) is very good. The SVML library supports vectors of 4 floats in an XMM
register (SSE2) or a vector of 8 floats in a YMM register (AVX). It will choose
the inferior generic path for non-Intel processors unless we replace Intel's
CPU-dispatcher as described above. Intel's libraries are available for both
Windows, Linux and Mac. AMD's LIBM library supports vectors of 4 floats. It is
available for Windows and Linux, but only in 64-bit mode.
The sad conclusion is that we have no fully optimized math
function library that supports all brands of x86 processors and all operating
systems. If we want optimal performance on all processors, the best choice is to
use Intel's SVML library and manipulate it into treating non-Intel processors
better.
It would be nice if more people would work on improving glibc.
This library supports all processors and platforms, but it is poorly optimized.
Only a few memory and string functions in glibc have CPU dispatching, while the
math functions have only old and poorly optimized versions. It would also be nice to have vector versions of the math functions in glibc because the Gnu compiler has support for such functions. |