Thread (97 messages) 97 messages, 14 authors, 2022-11-03

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

From: Sean Christopherson <seanjc@google.com>
Date: 2022-10-21 16:53:56
Also in: kvm, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel

On Fri, Oct 21, 2022, Chao Peng wrote:
On Thu, Oct 20, 2022 at 04:20:58PM +0530, Vishal Annapurve wrote:
quoted
On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov [off-list ref] wrote:
quoted
On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
quoted
I think moving this notifier_invalidate before fallocate may not solve
the problem completely. Is it possible that between invalidate and
fallocate, KVM tries to handle the page fault for the guest VM from
another vcpu and uses the pages to be freed to back gpa ranges? Should
hole punching here also update mem_attr first to say that KVM should
consider the corresponding gpa ranges to be no more backed by
inaccessible memfd?
We rely on external synchronization to prevent this. See code around
mmu_invalidate_retry_hva().

--
  Kiryl Shutsemau / Kirill A. Shutemov
IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn
ranges that are being invalidated are retried till invalidation is
complete. In this case, is it possible that KVM tries to serve the
page fault after inaccessible_notifier_invalidate is complete but
before fallocate could punch hole into the files?
It's not just the page fault edge case.  In the more straightforward scenario
where the memory is already mapped into the guest, freeing pages back to the kernel
before they are removed from the guest will lead to use-after-free.
quoted
e.g.
inaccessible_notifier_invalidate(...)
... (system event preempting this control flow, giving a window for
the guest to retry accessing the gfn range which was invalidated)
fallocate(.., PUNCH_HOLE..)
Looks this is something can happen.
And sounds to me the solution needs
just follow the mmu_notifier's way of using a invalidate_start/end pair.

  invalidate_start()  --> kvm->mmu_invalidate_in_progress++;
                          zap KVM page table entries;
  fallocate()
  invalidate_end()  --> kvm->mmu_invalidate_in_progress--;

Then during invalidate_start/end time window mmu_invalidate_retry_gfn
checks 'mmu_invalidate_in_progress' and prevent repopulating the same
page in KVM page table.
Yes, if it's not safe to invalidate after making the change (fallocate()), then
the change needs to be bookended by a start+end pair.  The mmu_notifier's unpaired
invalidate() hook works by zapping the primary MMU's PTEs before invalidate(), but
frees the underlying physical page _after_ invalidate().

And the only reason the unpaired invalidate() exists is because there are secondary
MMUs that reuse the primary MMU's page tables, e.g. shared virtual addressing, in
which case bookending doesn't work because the secondary MMU can't remove PTEs, it
can only flush its TLBs.

For this case, the whole point is to not create PTEs in the primary MMU, so there
should never be a use case that _needs_ an unpaired invalidate().

TL;DR: a start+end pair is likely the simplest solution.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help