This issue is getting more and more absurd the more I dig into it. AMD makes a function library called AMD Core Math Library (ACML) to match Intel's Math Kernel Library (MKL). I have tested a Windows version of ACML and found that some of the functions run faster when the CPU vendor ID is artificially changed to "GenuineIntel". Maybe this is not so surprising after all, since this version of ACML is compiled with Intel's Fortran compiler.
Here are some of the most marked test results:
Execution time
(lower is better) |
Faked CPU vendor ID |
|
ACML function |
VIA |
AMD |
Intel |
% difference |
drandlogistic |
1.95 |
1.96 |
1.84 |
6 |
drandexponential |
1.67 |
1.72 |
1.57 |
8 |
drandlognormal |
3.42 |
3.46 |
2.99 |
15 |
ACML version acml4.4.0-ifort32.exe, VIA L3050
1.8 GHz processor, Windows 7, 32 bit. MS VS 2010 C++.
Loop 100000 times * 256 values. Time unit = 109 clock cycles.
Average of 20 runs. |
On many of the functions in ACML there is little or no difference in
performance depending on the CPU vendor ID, but some functions have a
significant bias, as shown in the table above. Intel have repeatedly claimed
that their compilers give a good performance on AMD chips if you compile for the
SSE2 instruction set. Maybe the AMD people have believed this claim, or maybe
they had no other option since they couldn't find a better Fortran compiler.
With this compiler option, the compiler-generated code will be for the SSE2
instruction set only. I think that Intel first made the SSE2 recommendation at
the time when AMD processors supported only SSE2, so this was the best
performance you could get at that time. Today, you get suboptimal performance
when compiling for SSE2 because later instruction sets are not used. And of
course, the code will not work on older computers without SSE2.
To find the reason for the vendor ID effect, I decided to investigate the
function with the strongest effect, which is the drandlognormal function. After
a lot of detective work, I found that drandlognormal calls a logarithm function
in Intel's Short Vector Math Library (SVML). This logarithm function is
dispatched into three branches for the SSE2/generic, SSE3, and the future AVX
instruction set, respectively. It uses the standard Intel CPU dispatcher, which
gives the generic branch to all non-Intel processors. The SVML library supports
only SSE2 and above, so the generic branch uses SSE2. When my VIA processor
fakes to be an Intel, it gets the SSE3 branch, which is better optimized. The
difference in performance is likely to be higher on future processors that
support AVX.
There is another version of ACML for Windows built with the PGI compiler, but
I couldn't make it work because some library files were missing.
The proposed settlement with FTC requires that Intel shall reimburse its
compiler customers for the cost of recompiling their code with a different
compiler. While this reimbursement program probably has little more than
symbolic significance, it would be funny to see Intel compensating AMD for
relying on their compiler. Unfortunately, it will be difficult for AMD to find a better Fortran compiler. |