GNU C Library Sees Up To 12.9x Improvement With New Generic FMA Implementation

Last updated: 2025/11/27 at 6:38 AM

News Room Published 27 November 2025

Just a few days ago I wrote about the Glibc math code seeing a 4x improvement on AMD Zen by changing the used FMA implementation. Merged overnight was a new generic FMA implementation for the GNU C Library and now yielding up to a 12.9x throughput improvement on AMD Zen 3.

Adhemerval Zanella contributed this new generic FMA implementation to the GNU C Library. Zanella explained in the patch landing this new generic Fused Multiply Add (FMA) implementation:

“The current implementation relies on setting the rounding mode for different calculations (first to FE_TONEAREST and then to FE_TOWARDZERO) to obtain correctly rounded results. For most CPUs, this adds a significant performance overhead since it requires executing a typically slow instruction (to get/set the floating-point status), it necessitates flushing the pipeline, and breaks some compiler assumptions/optimizations.

This patch introduces a new implementation originally written by Szabolcs for musl, which utilizes mostly integer arithmetic. Floating-point arithmetic is used to raise the expected exceptions, without the need for fenv.h operations.

I added some changes compared to the original code:

* Fixed some signaling NaN issues when the 3-argument is NaN.

* Use math_uint128.h for the 64-bit multiplication operation. It allows the compiler to use 128-bit types where available, which enables some optimizations on certain targets (for instance, MIPS64).

* Fixed an arm32 issue where the libgcc routine might not respect the rounding mode. This can also be used on other targets to optimize the conversion from int64_t to double.

* Use -fexcess-precision=standard on i686.”

This new musl libc based implementation is showing some “large improvements” with tests carried out by Adhemerval Zanella:

New FMA implementation benchmarks

In another commit, Adhemerval Zanella summed up the recent math improvements made for Glibc 2.43 as:

“* Additional optimized and correctly rounded mathematical functions have been imported from the CORE-MATH project, in particular acosh, asinh, atanh, erf, erfc, lgamma, and tgamma.

* Optimized implementations for remainder, remaindef, frexpf, frexp, frexpl (binary128), and frexpl (intel96) have been added.

* The SVID handling for acosf, acoshf, asinhf, atan2f, atanhf, coshf, lgammaf/lgammaf_r, log10f, sinhf, sqrtf, tgammaf, y0/j0, y1/j1, and yn/jn were moved to compat symbols, allowing improvements in performance.”

Look for these improvements and more with Glibc 2.43 due for release in February.