Google engineer Eric Biggers who has been responsible for many great Linux cryptography subsystem performance optimizations in recent years has another exciting patch series. Biggers has done some great work for optimizing various functions for modern Intel/AMD CPUs especially around AVX-512 implementations and now he has another big optimization coming for the CRC32 checksum performance.
On Saturday was the new patch series by Eric Biggers working to improve CRC32C performance on lengths at least 512 bytes. The strategy here is relying on AVX-512’s VPCLMULQDQ vector carryless multiplication on capable CPUs.
With this patch series on the Linux kernel mailing list he explained:
“Improve crc32c() performance on lengths >= 512 bytes by using crc32_lsb_vpclmul_avx512() instead of crc32c_x86_3way(), when the CPU supports VPCLMULQDQ and has a “good” implementation of AVX-512. For now that means AMD Zen 4 and later, and Intel Sapphire Rapids and later. Pass crc32_lsb_vpclmul_avx512() the table of constants needed to make it use the CRC-32C polynomial.
Rationale: VPCLMULQDQ performance has improved on newer CPUs, making crc32_lsb_vpclmul_avx512() faster than crc32c_x86_3way(), even though crc32_lsb_vpclmul_avx512() is designed for generic 32-bit CRCs and does not utilize x86_64’s dedicated CRC-32C instructions.”
The end result? Some pretty nice speed-ups for the CRC32 performance on modern Intel and AMD CPUs that sport the “good” AVX-512 implementations:
Eric Biggers did go on to note though that even with recent Intel Xeon CPUs, AMD’s AVX-512 implementation does still have a lower warm-up time compared to using AVX-512 with recent Intel processors:
“That being said, in the above benchmarks the ZMM registers are “hot”, so they don’t quite tell the whole story. While significantly improved from older Intel CPUs, Intel still has ~2000 ns of ZMM warm-up time where 512-bit instructions execute 4 times more slowly than they normally do. In contrast, AMD does better and has virtually zero ZMM warm-up time (at most ~60 ns). Thus, while this change is always beneficial on AMD, strictly speaking there are cases in which it is not beneficial on Intel, e.g. a small number of 512-byte messages with “cold” ZMM registers. But typically, it is beneficial even on Intel.
Note that on AMD Zen 3–5, crc32c() performance could be further improved with implementations that interleave crc32q and VPCLMULQDQ instructions. Unfortunately, it appears that a different such implementation would be optimal on *each* of these microarchitectures. Such improvements are left for future work. This commit just improves the way that we choose the implementations we already have.”
The performance results are nice and hopefully this VPCLMULQDQ-optimized crc32c() code will make it to the mainline Linux kernel soon for further enhancing modern x86_64 AVX-512-capable processors.