Re: [RFC PATCH 0/6] KVM: x86: async PF user
From: Nikita Kalyazin <hidden>
Date: 2025-02-20 18:29:15
Also in:
kvm, linux-doc, lkml
On 19/02/2025 15:17, Sean Christopherson wrote:
On Wed, Feb 12, 2025, Nikita Kalyazin wrote:quoted
On 11/02/2025 21:17, Sean Christopherson wrote:quoted
On Mon, Nov 18, 2024, Nikita Kalyazin wrote: And it's not just the code itself, it's all the structures and concepts. Off the top of my head, I can't think of any reason there needs to be a separate queue, separate lock(s), etc. The only difference between kernel APF and user APF is what chunk of code is responsible for faulting in the page.There are two queues involved: - "queue": stores in-flight faults. APF-kernel uses it to cancel all works if needed. APF-user does not have a way to "cancel" userspace works, but it uses the queue to look up the struct by the token when userspace reports a completion. - "ready": stores completed faults until KVM finds a chance to tell guest about them. I agree that the "ready" queue can be shared between APF-kernel and -user as it's used in the same way. As for the "queue" queue, do you think it's ok to process its elements differently based on the "type" of them in a single loop [1] instead of having two separate queues?Yes.quoted
[1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120quoted
I suspect a good place to start would be something along the lines of the below diff, and go from there. Given that KVM already needs to special case the fake "wake all" items, I'm guessing it won't be terribly difficult to teach the core flows about userspace async #PF.That sounds sensible. I can certainly approach it in a "bottom up" way by sparingly adding handling where it's different in APF-user rather than adding it side by side and trying to merge common parts.quoted
I'm also not sure that injecting async #PF for all userfaults is desirable. For in-kernel async #PF, KVM knows that faulting in the memory would sleep. For userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should be handled via async #PF. The obvious answer is to have userspace only enable userspace async #PF when it's useful, but "an all or nothing" approach isn't great uAPI. On the flip side, adding uAPI for a use case that doesn't exist doesn't make sense either :-/I wasn't able to locate the code that would check whether faulting would sleep in APF-kernel. KVM spins APF-kernel whenever it can ([2]). Please let me know if I'm missing something here.kvm_can_do_async_pf() will be reached if and only if faulting in the memory requires waiting. If a page is swapped out, but faulting it back in doesn't require waiting, e.g. because it's in zswap and can be uncompressed synchronously, then the initial __kvm_faultin_pfn() with FOLL_NO_WAIT will succeed. /* * If resolving the page failed because I/O is needed to fault-in the * page, then either set up an asynchronous #PF to do the I/O, or if * doing an async #PF isn't possible, retry with I/O allowed. All * other failures are terminal, i.e. retrying won't help. */ if (fault->pfn != KVM_PFN_ERR_NEEDS_IO) return RET_PF_CONTINUE; if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) { trace_kvm_try_async_get_page(fault->addr, fault->gfn); if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) { trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn); kvm_make_request(KVM_REQ_APF_HALT, vcpu); return RET_PF_RETRY; } else if (kvm_arch_setup_async_pf(vcpu, fault)) { return RET_PF_RETRY; } } The conundrum with userspace async #PF is that if userspace is given only a single bit per gfn to force an exit, then KVM won't be able to differentiate between "faults" that will be handled synchronously by the vCPU task, and faults that usersepace will hand off to an I/O task. If the fault is handled synchronously, KVM will needlessly inject a not-present #PF and a present IRQ.
Right, but from the guest's point of view, async PF means "it will probably take a while for the host to get the page, so I may consider doing something else in the meantime (ie schedule another process if available)". If we are exiting to userspace, it isn't going to be quick anyway, so we can consider all such faults "long" and warranting the execution of the async PF protocol. So always injecting a not-present #PF and page ready IRQ doesn't look too wrong in that case.
But that's a non-issue if the known use cases are all-or-nothing, i.e. if all userspace faults are either synchronous or asynchronous.
Yes, pretty much. The user will be choosing the extreme that is more performant for their specific usecase.
quoted
[2] https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360quoted
Exiting to userspace in vCPU context is also kludgy. It makes sense for base userfault, because the vCPU can't make forward progress until the fault is resolved. Actually, I'm not even sure it makes sense there. I'll follow-up inEven though we exit to userspace, in case of APF-user, userspace is supposed to VM enter straight after scheduling the async job, which is then executed concurrently with the vCPU.quoted
James' series. Anyways, it definitely doesn't make sense for async #PF, because the whole point is to let the vCPU run. Signalling userspace would definitely add complexity, but only because of the need to communicate the token and wait for userspace to consume said token. I'll think more on that.By signalling userspace you mean a new non-exit-to-userspace mechanism similar to UFFD?Yes.quoted
What advantage can you see in it over exiting to userspace (which already exists in James's series)?It doesn't exit to userspace :-) If userspace simply wakes a different task in response to the exit, then KVM should be able to wake said task, e.g. by signalling an eventfd, and resume the guest much faster than if the vCPU task needs to roundtrip to userspace. Whether or not such an optimization is worth the complexity is an entirely different question though.
This reminds me of the discussion about VMA-less UFFD that was coming up several times, such as [1], but AFAIK hasn't materialised into something actionable. I may be wrong, but James was looking into that and couldn't figure out a way to scale it sufficiently for his use case and had to stick with the VM-exit-based approach. Can you see a world where VM-exit userfaults coexist with no-VM-exit way of handling async PFs? [1]: https://lore.kernel.org/kvm/ZqwKuzfAs7pvdHAN@x1n/ (local)