Re: [RFC PATCH 0/6] KVM: x86: async PF user
From: Nikita Kalyazin <hidden>
Date: 2025-02-12 18:14:22
Also in:
kvm, linux-doc, lkml
On 11/02/2025 21:17, Sean Christopherson wrote:
On Mon, Nov 18, 2024, Nikita Kalyazin wrote:quoted
Async PF [1] allows to run other processes on a vCPU while the host handles a stage-2 fault caused by a process on that vCPU. When using VM-exit-based stage-2 fault handling [2], async PF functionality is lost because KVM does not run the vCPU while a fault is being handled so no other process can execute on the vCPU. This patch series extends VM-exit-based stage-2 fault handling with async PF support by letting userspace handle faults instead of the kernel, hence the "async PF user" name. I circulated the idea with Paolo, Sean, David H, and James H at the LPC, and the only concern I heard was about injecting the "page not present" event via #PF exception in the CoCo case, where it may not work. In my implementation, I reused the existing code for doing that, so the async PF user implementation is on par with the present async PF implementation in this regard, and support for the CoCo case can be added separately. Please note that this series is applied on top of the VM-exit-based stage-2 fault handling RFC [2]....quoted
Nikita Kalyazin (6): Documentation: KVM: add userfault KVM exit flag Documentation: KVM: add async pf user doc KVM: x86: add async ioctl support KVM: trace events: add type argument to async pf KVM: x86: async_pf_user: add infrastructure KVM: x86: async_pf_user: hook to fault handling and add ioctl Documentation/virt/kvm/api.rst | 35 ++++++ arch/x86/include/asm/kvm_host.h | 12 +- arch/x86/kvm/Kconfig | 7 ++ arch/x86/kvm/lapic.c | 2 + arch/x86/kvm/mmu/mmu.c | 68 ++++++++++- arch/x86/kvm/x86.c | 101 +++++++++++++++- arch/x86/kvm/x86.h | 2 + include/linux/kvm_host.h | 30 +++++ include/linux/kvm_types.h | 1 + include/trace/events/kvm.h | 50 +++++--- include/uapi/linux/kvm.h | 12 +- virt/kvm/Kconfig | 3 + virt/kvm/Makefile.kvm | 1 + virt/kvm/async_pf.c | 2 +- virt/kvm/async_pf_user.c | 197 ++++++++++++++++++++++++++++++++ virt/kvm/async_pf_user.h | 24 ++++ virt/kvm/kvm_main.c | 14 +++ 17 files changed, 535 insertions(+), 26 deletions(-)I am supportive of the idea, but there is way too much copy+paste in this series.
Hi Sean, Yes, like I mentioned in the cover letter, I left the new implementation isolated on purpose to make the scope of the change clear. There is certainly lots of duplication that should be removed later on.
And it's not just the code itself, it's all the structures and concepts. Off the top of my head, I can't think of any reason there needs to be a separate queue, separate lock(s), etc. The only difference between kernel APF and user APF is what chunk of code is responsible for faulting in the page.
There are two queues involved: - "queue": stores in-flight faults. APF-kernel uses it to cancel all works if needed. APF-user does not have a way to "cancel" userspace works, but it uses the queue to look up the struct by the token when userspace reports a completion. - "ready": stores completed faults until KVM finds a chance to tell guest about them. I agree that the "ready" queue can be shared between APF-kernel and -user as it's used in the same way. As for the "queue" queue, do you think it's ok to process its elements differently based on the "type" of them in a single loop [1] instead of having two separate queues? [1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120
I suspect a good place to start would be something along the lines of the below diff, and go from there. Given that KVM already needs to special case the fake "wake all" items, I'm guessing it won't be terribly difficult to teach the core flows about userspace async #PF.
That sounds sensible. I can certainly approach it in a "bottom up" way by sparingly adding handling where it's different in APF-user rather than adding it side by side and trying to merge common parts.
I'm also not sure that injecting async #PF for all userfaults is desirable. For in-kernel async #PF, KVM knows that faulting in the memory would sleep. For userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should be handled via async #PF. The obvious answer is to have userspace only enable userspace async #PF when it's useful, but "an all or nothing" approach isn't great uAPI. On the flip side, adding uAPI for a use case that doesn't exist doesn't make sense either :-/
I wasn't able to locate the code that would check whether faulting would sleep in APF-kernel. KVM spins APF-kernel whenever it can ([2]). Please let me know if I'm missing something here. [2] https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360
Exiting to userspace in vCPU context is also kludgy. It makes sense for base userfault, because the vCPU can't make forward progress until the fault is resolved. Actually, I'm not even sure it makes sense there. I'll follow-up in
Even though we exit to userspace, in case of APF-user, userspace is supposed to VM enter straight after scheduling the async job, which is then executed concurrently with the vCPU.
James' series. Anyways, it definitely doesn't make sense for async #PF, because the whole point is to let the vCPU run. Signalling userspace would definitely add complexity, but only because of the need to communicate the token and wait for userspace to consume said token. I'll think more on that.
By signalling userspace you mean a new non-exit-to-userspace mechanism similar to UFFD? What advantage can you see in it over exiting to userspace (which already exists in James's series)? Thanks, Nikita
quoted hunk ↗ jump to hunk
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index 0ee4816b079a..fc31b47cf9c5 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c@@ -177,7 +177,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu) * success, 'false' on failure (page fault has to be handled synchronously). */ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, - unsigned long hva, struct kvm_arch_async_pf *arch) + unsigned long hva, struct kvm_arch_async_pf *arch, + bool userfault) { struct kvm_async_pf *work;@@ -202,13 +203,16 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, work->addr = hva; work->arch = *arch; - INIT_WORK(&work->work, async_pf_execute); - list_add_tail(&work->queue, &vcpu->async_pf.queue); vcpu->async_pf.queued++; work->notpresent_injected = kvm_arch_async_page_not_present(vcpu, work); - schedule_work(&work->work); + if (userfault) { + work->userfault = true; + } else { + INIT_WORK(&work->work, async_pf_execute); + schedule_work(&work->work); + } return true; }