A patch posted today to the Linux kernel mailing list provides an ARM64-optimized CRC64-NVMe implementation for nearly a 6x improvement on modern Arm SoCs.
Open-source developer Demian Shulhan added this NEON-optimized CRC64 implementation, similar to the other architecture-specific CRC64 implementations such as for x86_64 and RISC-V. The intent on this CRC64 speed-up is for benefiting NVMe and other storage devices in addressing this bottleneck.
Shulhan explained in the patch and the nearly 6x gain was for an Arm Crotex-A72 SoC. He wrote:
“Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR software implementation is slow, which creates a bottleneck in NVMe and other storage subsystems.
The acceleration is implemented using C intrinsics (arm_neon.h) rather than raw assembly for better readability and maintainability.
Key highlights of this implementation:
– Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency spikes on large buffers.
– Pre-calculates and loads fold constants via vld1q_u64() to minimize register spilling.
– Benchmarks show the break-even point against the generic implementation is around 128 bytes. The PMULL path is enabled only for len >= 128.
– Safely falls back to the generic implementation on Big-Endian systems.Performance results (kunit crc_benchmark on Cortex-A72):
– Generic (len=4096): ~268 MB/s
– PMULL (len=4096): ~1556 MB/s (nearly 6x improvement)”
It’s surprising it took until now to see an ARM64/NEON-optimized CRC64 implementation for the Linux kernel at just a little more than one hundred lines of code.
The patch is now out for review on the Linux kernel mailing list.
