ARM NEON Accelerated CRC64 Optimization Shows Nearly 6x Improvement

Last updated: 2026/03/17 at 6:01 AM

News Room Published 17 March 2026

A patch posted today to the Linux kernel mailing list provides an ARM64-optimized CRC64-NVMe implementation for nearly a 6x improvement on modern Arm SoCs.

Open-source developer Demian Shulhan added this NEON-optimized CRC64 implementation, similar to the other architecture-specific CRC64 implementations such as for x86_64 and RISC-V. The intent on this CRC64 speed-up is for benefiting NVMe and other storage devices in addressing this bottleneck.

Shulhan explained in the patch and the nearly 6x gain was for an Arm Crotex-A72 SoC. He wrote:

“Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR software implementation is slow, which creates a bottleneck in NVMe and other storage subsystems.

The acceleration is implemented using C intrinsics (arm_neon.h) rather than raw assembly for better readability and maintainability.

Key highlights of this implementation:
– Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency spikes on large buffers.
– Pre-calculates and loads fold constants via vld1q_u64() to minimize register spilling.
– Benchmarks show the break-even point against the generic implementation is around 128 bytes. The PMULL path is enabled only for len >= 128.
– Safely falls back to the generic implementation on Big-Endian systems.

Performance results (kunit crc_benchmark on Cortex-A72):
– Generic (len=4096): ~268 MB/s
– PMULL (len=4096): ~1556 MB/s (nearly 6x improvement)”

It’s surprising it took until now to see an ARM64/NEON-optimized CRC64 implementation for the Linux kernel at just a little more than one hundred lines of code.

benchmark results of NEON CRC64 implementation