It’s not too often that “fixes” to the Kernel-based Virtual Machine (KVM) are noteworthy but today is an interesting exception with among the KVM fixes sent in today ahead of the Linux 6.13-rc3 tagging is for beginning to deal with a “hilarious/revolting” performance regression affecting recent generations of Intel processors. This performance regression won’t be fully worked around until Linux 6.14 but at least there is an interim step in place once the code is merged later today.
Catching my attention with this morning’s KVM changes for Linux 6.13-rc3 is this lone KVM x86-related change:
“Cache CPUID.0xD XSTATE offsets+sizes during module init – On Intel’s Emerald Rapids CPUID costs hundreds of cycles and there are a lot of leaves under 0xD. Getting rid of the CPUIDs during nested VM-Enter and VM-Exit is planned for the next release, for now just cache them: even on Skylake that is 40% faster.”
Okay, this is intriguing… Resorting to XSTATE caching to deal with newer Intel Xeon Emerald Rapids CPUs being much more costly than prior processors.
Sean Christopherson of Google is the one that’s been investigating and taking on this higher cost with newer Intel Xeon processors. In a patch series this week he aims to address the xstate_required_size() performance regression. He sums up in the patch cover letter:
“Fix a hilarious/revolting performance regression (relative to older CPU generations) in xstate_required_size() that pops up due to CPUID _in the host_ taking 3x-4x longer on Emerald Rapids than Skylake.
The issue rears its head on nested virtualization transitions, as KVM (unnecessarily) performs runtime CPUID updates, including XSAVE sizes, multiple times per transition. And calculating XSAVE sizes, especially for vCPUs with a decent number of supported XSAVE features and compacted format support, can add up to thousands of cycles.
To fix the immediate issue, cache the CPUID output at kvm.ko load. The information is static for a given CPU, i.e. doesn’t need to be re-read from hardware every time. That’s patch 1, and eliminates pretty much all of the meaningful overhead.
Patch 2 is a minor cleanup to try and make the code easier to read.
Patch 3 fixes a wart in CPUID emulation where KVM does a moderately expensive (though cheap compared to CPUID, lol) MSR lookup that is likely unnecessary for the vast majority of VMs.
Patches 4 and 5 address the problem of KVM doing runtime CPUID updates multiple times for each nested VM-Enter and VM-Exit, at least half of which are completely unnecessary (CPUID is a mandatory intercept on both Intel and AMD, so ensuring dynamic CPUID bits are up-to-date while running L2 is pointless). The idea is fairly simple: lazily do the CPUID updates by deferring them until something might actually consume guest the relevant
bits.
…
That said, patch 1, which is the most important and tagged for stable, applies cleanly on 6.1, 6.6, and 6.12 (and the backport for 5.15 and earlier shouldn’t be too horrific).Side topic, I can’t help but wonder if the CPUID latency on EMR is a CPU or ucode bug. For a number of leaves, KVM can emulate CPUID faster than the CPUID can execute the instruction. I.e. the entire VM-Exit => emulate => VM-Enter sequence takes less time than executing CPUID on bare metal. Which seems absolutely insane. But, it shouldn’t impact guest performance, so that’s someone else’s problem, at least for now.”
That patch series in full is expected to land for the next cycle, Linux 6.14, while for Linux 6.13 as an immediate “fix” is the caching approach. That patch going to the mainline kernel today further sums up the situation and the higher Intel costs beginning with Emerald Rapids:
“On Intel’s Emerald Rapids, CPUID is *wildly* expensive, to the point where recomputing XSAVE offsets and sizes results in a 4x increase in latency of nested VM-Enter and VM-Exit (nested transitions can trigger xstate_required_size() multiple times per transition), relative to using cached values. The issue is easily visible by running `perf top` while triggering nested transitions: kvm_update_cpuid_runtime() shows up at a whopping 50%.
As measured via RDTSC from L2 (using KVM-Unit-Test’s CPUID VM-Exit test and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald Rapids (EMR) takes:
SKX 11650
ICX 22350
EMR 28850Using cached values, the latency drops to:
SKX 6850
ICX 9000
EMR 7900The underlying issue is that CPUID itself is slow on ICX, and comically slow on EMR. The problem is exacerbated on CPUs which support XSAVES and/or XSAVEC, as KVM invokes xstate_required_size() twice on each runtime CPUID update, and because there are more supported XSAVE features (CPUID for supported XSAVE feature sub-leafs is significantly slower).”
With Intel Xeon Emerald Rapids having launched one year ago, it’s surprising that it’s taken until now to work on a solution to address these higher costs for the widely-used KVM that’s an important piece of the Linux open-source virtualization stack. More so that Google engineers got to the bottom of this for better handling within Google Cloud as opposed to coming from Intel. No word with the patches if the expensive CPUID behavior is also observed for the newest Sierra Forest and Granite Rapids processors.