Just a few days ago I wrote about the Glibc math code seeing a 4x improvement on AMD Zen by changing the used FMA implementation. Merged overnight was a new generic FMA implementation for the GNU C Library and now yielding up to a 12.9x throughput improvement on AMD Zen 3.
Adhemerval Zanella contributed this new generic FMA implementation to the GNU C Library. Zanella explained in the patch landing this new generic Fused Multiply Add (FMA) implementation:
“The current implementation relies on setting the rounding mode for different calculations (first to FE_TONEAREST and then to FE_TOWARDZERO) to obtain correctly rounded results. For most CPUs, this adds a significant performance overhead since it requires executing a typically slow instruction (to get/set the floating-point status), it necessitates flushing the pipeline, and breaks some compiler assumptions/optimizations.
This patch introduces a new implementation originally written by Szabolcs for musl, which utilizes mostly integer arithmetic. Floating-point arithmetic is used to raise the expected exceptions, without the need for fenv.h operations.
I added some changes compared to the original code:
* Fixed some signaling NaN issues when the 3-argument is NaN.
* Use math_uint128.h for the 64-bit multiplication operation. It allows the compiler to use 128-bit types where available, which enables some optimizations on certain targets (for instance, MIPS64).
* Fixed an arm32 issue where the libgcc routine might not respect the rounding mode. This can also be used on other targets to optimize the conversion from int64_t to double.
* Use -fexcess-precision=standard on i686.”
This new musl libc based implementation is showing some “large improvements” with tests carried out by Adhemerval Zanella:
In another commit, Adhemerval Zanella summed up the recent math improvements made for Glibc 2.43 as:
“* Additional optimized and correctly rounded mathematical functions have been imported from the CORE-MATH project, in particular acosh, asinh, atanh, erf, erfc, lgamma, and tgamma.
* Optimized implementations for remainder, remaindef, frexpf, frexp, frexpl (binary128), and frexpl (intel96) have been added.
* The SVID handling for acosf, acoshf, asinhf, atan2f, atanhf, coshf, lgammaf/lgammaf_r, log10f, sinhf, sqrtf, tgammaf, y0/j0, y1/j1, and yn/jn were moved to compat symbols, allowing improvements in performance.”
Look for these improvements and more with Glibc 2.43 due for release in February.
