Merged today for the widely-used FFmpeg open-source multimedia library was yet another AVX-512 optimized code path… Compared to the pure C code, the AVX2 code path was 10.98x faster while this new AVX-512 code path clocks in at 18x the performance of the common C code.
The latest FFmpeg code seeing the AVX-512 treatment is the uyvytoyuv422 function for UYVY to YUV422 format conversion. The AVX-512 optimized code path via hand-written Assembly is a great benefit here. AVX-512 namely found with Intel Xeon processors or all AMD Ryzen and EPYC processors since Zen 4. The benchmarks posted for this patch were carried out with an AMD Ryzen 9 7950X.
The gains are very beneficial with this AVX-512 code path hitting 18.02x the performance of the common C path while the AVX2 only path goes at 10.98x.
Shreesh Adiga who authored the patch explained:
“The scalar loop is replaced with masked AVX512 instructions. For extracting the Y from UYVY, vperm2b is used instead of various AND and packuswb.
Instead of loading the vectors with interleaved lanes as done in AVX2 version, normal load is used. At the end of packuswb, for U and V, an extra permute operation is done to get the required layout.”
A nice win for the next FFmpeg release assuming your CPU supports AVX-512. That’s especially true for AMD Zen 4 and even more so with the great AVX-512 AMD Zen 5 showing across their entire CPU product stack.