H.J. Lu, a long-time compiler expert at Intel, merged today improved memmove() behavior for the GNU Compiler Collection ahead of the upcoming GCC 16 release.
The change for GCC x86/x86_64 is for inlining memmove with overlapping unaligned loads and stores. H.J. Lu argued his rationale with the patch for inlining memmove functionality more:
“x86-64: Inline memmove with overlapping unaligned loads and stores
Inline memmove in 64-bit since there are much less registers available in 32-bit:
1. Load all sources into registers and store them together to avoid possible address overlap between source and destination.
2. For known size, first try to fully unroll with 8 registers.
3. For size <= 2 * MOVE_MAX, load all sources into 2 registers first and then store them together.
4. For size > 2 * MOVE_MAX and size <= 4 * MOVE_MAX, load all sources into 4 registers first and then store them together.
5. For size > 4 * MOVE_MAX and size <= 8 * MOVE_MAX, load all sources into 8 registers first and then store them together.
6. For size > 8 * MOVE_MAX,
a. If address of destination > address of source, copy backward with a 4 * MOVE_MAX loop with unaligned loads and stores. Load the first 4 * MOVE_MAX into 4 registers before the loop and store them after the loop to support overlapping addresses.
b. Otherwise, copy forward with a 4 * MOVE_MAX loop with unaligned loads and stores. Load the last 4 * MOVE_MAX into 4 registers before the loop and store them after the loop to support overlapping addresses.Verified and benchmarked memmove implementations inlined with GPR, SSE2, AVX2 and AVX512 using glibc memmove tests.
…
Their performances are comparable with optimized memmove implementations in glibc on Intel Core i7-1195G7.”
The code was merged this morning ahead of GCC 16’s stage 3 milestone this month. GCC 16.1 as the first stable release of GCC 16 should be out around March~April.
