Re: [RFC PATCH 0/6] KVM: x86: async PF user
From: Sean Christopherson <seanjc@google.com>
Date: 2025-02-27 16:44:04
Also in:
kvm, linux-doc, lkml
On Wed, Feb 26, 2025, Nikita Kalyazin wrote:
On 26/02/2025 00:58, Sean Christopherson wrote:quoted
On Fri, Feb 21, 2025, Nikita Kalyazin wrote:quoted
On 20/02/2025 18:49, Sean Christopherson wrote:quoted
On Thu, Feb 20, 2025, Nikita Kalyazin wrote:quoted
On 19/02/2025 15:17, Sean Christopherson wrote:quoted
On Wed, Feb 12, 2025, Nikita Kalyazin wrote: The conundrum with userspace async #PF is that if userspace is given only a single bit per gfn to force an exit, then KVM won't be able to differentiate between "faults" that will be handled synchronously by the vCPU task, and faults that usersepace will hand off to an I/O task. If the fault is handled synchronously, KVM will needlessly inject a not-present #PF and a present IRQ.Right, but from the guest's point of view, async PF means "it will probably take a while for the host to get the page, so I may consider doing something else in the meantime (ie schedule another process if available)".Except in this case, the guest never gets a chance to run, i.e. it can't do something else. From the guest point of view, if KVM doesn't inject what is effectively a spurious async #PF, the VM-Exiting instruction simply took a (really) long time to execute.Sorry, I didn't get that. If userspace learns from the kvm_run::memory_fault::flags that the exit is due to an async PF, it should call kvm run immediately, inject the not-present PF and allow the guest to reschedule. What do you mean by "the guest never gets a chance to run"?What I'm saying is that, as proposed, the API doesn't precisely tell userspace
^^^^^^^^^
KVMquoted
an exit happened due to an "async #PF". KVM has absolutely zero clue as to whether or not userspace is going to do an async #PF, or if userspace wants to intercept the fault for some entirely different purpose.Userspace is supposed to know whether the PF is async from the dedicated flag added in the memory_fault structure: KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER. It will be set when KVM managed to inject page-not-present. Are you saying it isn't sufficient?
Gah, sorry, typo. The API doesn't tell *KVM* that userfault exit is due to an async #PF.
quoted
Unless the remote page was already requested, e.g. by a different vCPU, or by a prefetching algorithim.quoted
Conversely, if the page content is available, it must have already been prepopulated into guest memory pagecache, the bit in the bitmap is cleared and no exit to userspace occurs.But that doesn't happen instantaneously. Even if the VMM somehow atomically receives the page and marks it present, it's still possible for marking the page present to race with KVM checking the bitmap.That looks like a generic problem of the VM-exit fault handling. Eg when
Heh, it's a generic "problem" for faults in general. E.g. modern x86 CPUs will take "spurious" page faults on write accesses if a PTE is writable in memory but the CPU has a read-only mapping cached in its TLB. It's all a matter of cost. E.g. pre-Nehalem Intel CPUs didn't take such spurious read-only faults as they would re-walk the in-memory page tables, but that ended up being a net negative because the cost of re-walking for all read-only faults outweighed the benefits of avoiding spurious faults in the unlikely scenario the fault had already been fixed. For a spurious async #PF + IRQ, the cost could be signficant, e.g. due to causing unwanted context switches in the guest, in addition to the raw overhead of the faults, interrupts, and exits.
one vCPU exits, userspace handles the fault and races setting the bitmap with another vCPU that is about to fault the same page, which may cause a spurious exit. On the other hand, is it malignant? The only downside is additional overhead of the async PF protocol, but if the race occurs infrequently, it shouldn't be a problem.
When it comes to uAPI, I want to try and avoid statements along the lines of "IF 'x' holds true, then 'y' SHOULDN'T be a problem". If this didn't impact uAPI, I wouldn't care as much, i.e. I'd be much more willing iterate as needed. I'm not saying we should go straight for a complex implementation. Quite the opposite. But I do want us to consider the possible ramifications of using a single bit for all userfaults, so that we can at least try to design something that is extensible and won't be a pain to maintain.