While Linus Torvalds doesn’t too often like new kernel options being enabled by default, one area where it has proven beneficial and otherwise an oversight by those configuring their own kernel builds is the architecture-optimized crypto algorithm implementations. Some will enable support for different kernel crypto algorithms only to forget or be unaware that there are CPU architecture specific implementations that can also typically be enabled for much better performance over the common code. Google engineer Eric Biggers has been cleaning this up and BLAKE2s is the latest receiving treatment.
Eric Biggers who has been known for his relentless crypto kernel optimizations over the years sent out a new patch series on Wednesday to better clean-up the ChaCha and BLAKE2s code. As part of that patch series is also enabling the architecture-optimized BLAKE2s code by default, similar to the process other crypto algorithms have gone through.
Of the patch series most interesting is the patch to always enable the arch-optimized BLAKE2s code. There he argues:
“When support for a crypto algorithm is enabled, the arch-optimized implementation of that algorithm should be enabled too. We’ve learned this the hard way many times over the years: people regularly forget to enable the arch-optimized implementations of the crypto algorithms, resulting in significant performance being left on the table.
Currently, BLAKE2s support is always enabled (‘obj-y’), since random.c uses it. Therefore, the arch-optimized BLAKE2s code, which exists for ARM and x86_64, should be always enabled too. Let’s do that.
Note that the effect on kernel image size is very small and should not be a concern. On ARM, enabling CRYPTO_BLAKE2S_ARM actually *shrinks* the kernel size by about 1200 bytes, since the ARM-optimized blake2s_compress() completely replaces the generic blake2s_compress(). On x86_64, enabling CRYPTO_BLAKE2S_X86 increases the kernel size by about 1400 bytes, as the generic blake2s_compress() is still included as a fallback; however, for context, that is only about a quarter the size of the generic blake2s_compress(). The x86_64 optimized BLAKE2s code uses much less icache at runtime than the generic code.”
In the case of the x86_64 optimized BLAKE2s, this allows for SSSE3 and AVX-512 usage for faster BLAKE2s cryptographic hashing performance.