AWS engineers have been working on Linux kernel improvements to KVM’s VMX code for enhancing the unamanged guest memory when dealing with nested virtual machines. The improved code addresses some correctness issues as well as delivering wild performance improvements within a synthetic benchmark.
On Friday Amazon/AWS engineer Fred Griffoul sent out the latest patches to the KVM nVMX code for improving the performance of unmanaged guest memory. Fred explained of the issue with the current code and the improvement being made:
“This patch series addresses both performance and correctness issues in nested VMX when handling guest memory.
During nested VMX operations, L0 (KVM) accesses specific L1 guest pages to manage L2 execution. These pages fall into two categories: pages accessed only by L0 (such as the L1 MSR bitmap page or the eVMCS page), and pages passed to the L2 guest via vmcs02 (such as APIC access, virtual APIC, and posted interrupt descriptor pages).
The current implementation uses kvm_vcpu_map/unmap, which causes two issues.
First, the current approach is missing proper invalidation handling in critical scenarios. Enlightened VMCS (eVMCS) pages can become stale when memslots are modified, as there is no mechanism to invalidate the cached mappings. Similarly, APIC access and virtual APIC pages can be migrated by the host, but without proper notification through mmu_notifier callbacks, the mappings become invalid and can lead to incorrect behavior.
Second, for unmanaged guest memory (memory not directly mapped by the kernel, such as memory passed with the mem= parameter or guest_memfd for non-CoCo VMs), this workflow invokes expensive memremap/memunmap operations on every L2 VM entry/exit cycle. This creates significant overhead that impacts nested virtualization performance.
This series replaces kvm_host_map with gfn_to_pfn_cache in nested VMX. The pfncache infrastructure maintains persistent mappings as long as the page GPA does not change, eliminating the memremap/memunmap overhead on every VM entry/exit cycle. Additionally, pfncache provides proper invalidation handling via mmu_notifier callbacks and memslots generation check, ensuring that mappings are correctly updated during both memslot updates and page migration events.”
The end result is pretty wild with a synthetic micro-benchmark to demonstrate the improvement for nested VMX operations with unmanaged guest memory. Using AWS EC2 Nitro instances, their synthetic micro-benchmark showed the memory map performance being around 17x faster, unmap chunked being ~2014x faster, and unmap being ~2353x faster!
Those interested can find this pending patch series on the Linux kernel mailing list.
