Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create... | linux-api

[PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM · Chao Peng <hidden> · 2022-10-25
[PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory · Chao Peng <hidden> · 2022-10-25
Re: [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory · Fuad Tabba <hidden> · 2022-10-27
Re: [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory · Xiaoyao Li <hidden> · 2022-10-28
Re: [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory · Chao Peng <hidden> · 2022-10-31
Re: [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory · Alex Bennée <hidden> · 2022-11-14
Re: [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory · Chao Peng <hidden> · 2022-11-15
[PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Chao Peng <hidden> · 2022-10-25
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Isaku Yamahata <hidden> · 2022-10-26
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Chao Peng <hidden> · 2022-10-28
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Fuad Tabba <hidden> · 2022-10-27
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Michael Roth <hidden> · 2022-10-31
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Chao Peng <hidden> · 2022-11-01
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Michael Roth <hidden> · 2022-11-01
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Michael Roth <hidden> · 2022-11-01
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Chao Peng <hidden> · 2022-11-02
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Michael Roth <hidden> · 2022-11-02
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Vlastimil Babka <hidden> · 2022-11-14
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Kirill A. Shutemov <hidden> · 2022-11-14
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Michael Roth <hidden> · 2022-11-14
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Chao Peng <hidden> · 2022-11-15
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Michael Roth <hidden> · 2022-11-14
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Kirill A. Shutemov <hidden> · 2022-11-02
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Michael Roth <hidden> · 2022-11-02
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Michael Roth <hidden> · 2022-11-02
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Kirill A. Shutemov <hidden> · 2022-11-03
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Michael Roth <hidden> · 2022-11-29
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Kirill A. Shutemov <hidden> · 2022-11-29
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · David Hildenbrand <hidden> · 2022-11-29
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Chao Peng <hidden> · 2022-11-29
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Chao Peng <hidden> · 2022-11-29
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Michael Roth <hidden> · 2022-11-29
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Chao Peng <hidden> · 2022-11-29
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Michael Roth <hidden> · 2022-11-29
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Michael Roth <hidden> · 2022-11-29
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Chao Peng <hidden> · 2022-11-30
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Michael Roth <hidden> · 2022-11-30
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Vishal Annapurve <hidden> · 2022-11-29
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Vishal Annapurve <hidden> · 2022-12-02
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Chao Peng <hidden> · 2022-12-02
Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory · Kirill A . Shutemov <hidden> · 2022-12-02
[PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Chao Peng <hidden> · 2022-10-25
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Peter Maydell <hidden> · 2022-10-25
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Sean Christopherson <seanjc@google.com> · 2022-10-25
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Fuad Tabba <hidden> · 2022-10-27
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Chao Peng <hidden> · 2022-10-28
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Alex Bennée <hidden> · 2022-11-15
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Chao Peng <hidden> · 2022-11-16
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Alex Bennée <hidden> · 2022-11-16
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Chao Peng <hidden> · 2022-11-17
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Alex Bennée <hidden> · 2022-11-17
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Chao Peng <hidden> · 2022-11-18
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Alex Bennée <hidden> · 2022-11-18
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Sean Christopherson <seanjc@google.com> · 2022-11-18
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Chao Peng <hidden> · 2022-11-22
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Sean Christopherson <seanjc@google.com> · 2022-11-23
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · "Andy Lutomirski" <luto@kernel.org> · 2022-11-16
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Sean Christopherson <seanjc@google.com> · 2022-11-16
Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit · Chao Peng <hidden> · 2022-11-17
[PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry · Chao Peng <hidden> · 2022-10-25
Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry · Fuad Tabba <hidden> · 2022-10-27
Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry · Chao Peng <hidden> · 2022-11-04
Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry · Sean Christopherson <seanjc@google.com> · 2022-11-04
Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry · Chao Peng <hidden> · 2022-11-08
Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry · Sean Christopherson <seanjc@google.com> · 2022-11-10
Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry · Sean Christopherson <seanjc@google.com> · 2022-11-10
Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry · Chao Peng <hidden> · 2022-11-11
[PATCH v9 5/8] KVM: Register/unregister the guest private memory regions · Chao Peng <hidden> · 2022-10-25
Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions · Fuad Tabba <hidden> · 2022-10-27
Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions · Sean Christopherson <seanjc@google.com> · 2022-11-03
Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions · Chao Peng <hidden> · 2022-11-04
Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions · Sean Christopherson <seanjc@google.com> · 2022-11-04
Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions · Chao Peng <hidden> · 2022-11-08
Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions · Yuan Yao <hidden> · 2022-11-08
Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions · Chao Peng <hidden> · 2022-11-08
Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions · Yuan Yao <hidden> · 2022-11-09
Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions · Sean Christopherson <seanjc@google.com> · 2022-11-16
Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions · Chao Peng <hidden> · 2022-11-17
[PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed · Chao Peng <hidden> · 2022-10-25
Re: [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed · Isaku Yamahata <hidden> · 2022-10-26
Re: [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed · Chao Peng <hidden> · 2022-10-28
Re: [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed · Yuan Yao <hidden> · 2022-11-08
Re: [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed · Chao Peng <hidden> · 2022-11-09
[PATCH v9 7/8] KVM: Handle page fault for private memory · Chao Peng <hidden> · 2022-10-25
Re: [PATCH v9 7/8] KVM: Handle page fault for private memory · Isaku Yamahata <hidden> · 2022-10-26
Re: [PATCH v9 7/8] KVM: Handle page fault for private memory · Chao Peng <hidden> · 2022-10-28
Re: [PATCH v9 7/8] KVM: Handle page fault for private memory · Isaku Yamahata <hidden> · 2022-11-01
Re: [PATCH v9 7/8] KVM: Handle page fault for private memory · Chao Peng <hidden> · 2022-11-01
Re: [PATCH v9 7/8] KVM: Handle page fault for private memory · Ackerley Tng <hidden> · 2022-11-16
Re: [PATCH v9 7/8] KVM: Handle page fault for private memory · Sean Christopherson <seanjc@google.com> · 2022-11-16
Re: [PATCH v9 7/8] KVM: Handle page fault for private memory · Chao Peng <hidden> · 2022-11-17
[PATCH v9 8/8] KVM: Enable and expose KVM_MEM_PRIVATE · Chao Peng <hidden> · 2022-10-25
Re: [PATCH v9 8/8] KVM: Enable and expose KVM_MEM_PRIVATE · Fuad Tabba <hidden> · 2022-10-27
Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM · Vishal Annapurve <hidden> · 2022-11-03
Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM · Isaku Yamahata <hidden> · 2022-11-08
Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM · Kirill A. Shutemov <hidden> · 2022-11-09
Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM · Kirill A. Shutemov <hidden> · 2022-11-15
Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM · Alex Bennée <hidden> · 2022-11-14
Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM · Chao Peng <hidden> · 2022-11-16
Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM · Alex Bennée <hidden> · 2022-11-16
Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM · Chao Peng <hidden> · 2022-11-17

Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

From: Vlastimil Babka <hidden>
Date: 2022-11-14 14:02:47
Also in: kvm, linux-arch, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel

On 11/1/22 16:19, Michael Roth wrote:

On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:

quoted

  1) restoring kernel directmap:

     Currently SNP (and I believe TDX) need to either split or remove kernel
     direct mappings for restricted PFNs, since there is no guarantee that
     other PFNs within a 2MB range won't be used for non-restricted
     (which will cause an RMP #PF in the case of SNP since the 2MB
     mapping overlaps with guest-owned pages)

Has the splitting and restoring been a well-discussed direction? I'm
just curious whether there is other options to solve this issue.

For SNP it's been discussed for quite some time, and either splitting or
removing private entries from directmap are the well-discussed way I'm
aware of to avoid RMP violations due to some other kernel process using
a 2MB mapping to access shared memory if there are private pages that
happen to be within that range.

In both cases the issue of how to restore directmap as 2M becomes a
problem.

I was also under the impression TDX had similar requirements. If so,
do you know what the plan is for handling this for TDX?

There are also 2 potential alternatives I'm aware of, but these haven't
been discussed in much detail AFAIK:

a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
   request 2MB THP pages, but I'm not sure how reliably we can guarantee
   that enough THPs are available, so if we went that route we'd probably
   be better off requiring the use of hugetlbfs as the backing store. But
   obviously that's a bit limiting and it would be nice to have the option
   of using normal pages as well. One nice thing with invalidation
   scheme proposed here is that this would "Just Work" if implement
   hugetlbfs support, so an admin that doesn't want any directmap
   splitting has this option available, otherwise it's done as a
   best-effort.

b) Implement general support for restoring directmap as 2M even when
   subpages might be in use by other kernel threads. This would be the
   most flexible approach since it requires no special handling during
   invalidations, but I think it's only possible if all the CPA
   attributes for the 2M range are the same at the time the mapping is
   restored/unsplit, so some potential locking issues there and still
   chance for splitting directmap over time.

I've been hoping that

c) using a mechanism such as [1] [2] where the goal is to group together
these small allocations that need to increase directmap granularity so
maximum number of large mappings are preserved. But I guess that means
knowing at allocation time that this will happen. So I've been wondering how
this would be possible to employ in the SNP/UPM case? I guess it depends on
how we expect the private/shared conversions to happen in practice, and I
don't know the details. I can imagine the following complications:

- a memfd_restricted region is created such that it's 2MB large/aligned,
i.e. like case a) above, we can allocate it normally. Now, what if a 4k page
in the middle is to be temporarily converted to shared for some
communication between host and guest (can such thing happen?). With the
punch hole approach, I wonder if we end up fragmenting directmap
unnecessarily? IIUC the now shared page will become backed by some other
page (as the memslot supports both private and shared pages simultaneously).
But does it make sense to really split the direct mapping (and e.g. the
shmem page?) We could leave the whole 2MB unmapped without splitting if we
didn't free the private 4k subpage.

- a restricted region is created that's below 2MB. If something like [1] is
merged, it could be used for the backing pages to limit directmap
fragmentation. But then in case it's eventually fallocated to become larger
and gain one more more 2MB aligned ranges, the result is suboptimal. Unless
in that case we migrate the existing pages to a THP-backed shmem, kinda like
khugepaged collapses hugepages. But that would have to be coordinated with
the guest, maybe not even possible?

[1] https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org/ (local)
[2] https://lwn.net/Articles/894557/

quoted

     Previously we were able to restore 2MB mappings to some degree
     since both shared/restricted pages were all pinned, so anything
     backed by a THP (or hugetlb page once that is implemented) at guest
     teardown could be restored as 2MB direct mapping.

     Invalidation seems like the most logical time to have this happen,

Currently invalidation only happens at user-initiated fallocate(). It
does not cover the VM teardown case where the restoring might also be
expected to be handled.

Right, I forgot to add that in my proposed changes I added invalidations
for any still-allocated private pages present when the restricted memfd
notifier is unregistered. This was needed to avoid leaking pages back to
the kernel that still need directmap or RMP table fixups. I also added
similar invalidations for memfd->release(), since it seems possible that
userspace might close() it before shutting down guest, but maybe the
latter is not needed if KVM takes a reference on the FD during life of
the guest.

quoted

     but whether or not to restore as 2MB requires the order to be 2MB
     or larger, and for GPA range being invalidated to cover the entire
     2MB (otherwise it means the page was potentially split and some
     subpages free back to host already, in which case it can't be
     restored as 2MB).

  2) Potentially less invalidations:
      
     If we pass the entire folio or compound_page as part of
     invalidation, we only needed to issue 1 invalidation per folio.

I'm not sure I agree, the current invalidation covers the whole range
that passed from userspace and the invalidation is invoked only once for
each usrspace fallocate().

That's true, it only reduces invalidations if we decide to provide a
struct page/folio as part of the invalidation callbacks, which isn't
the case yet. Sorry for the confusion.

quoted

  3) Potentially useful for hugetlbfs support:

     One issue with hugetlbfs is that we don't support splitting the
     hugepage in such cases, which was a big obstacle prior to UPM. Now
     however, we may have the option of doing "lazy" invalidations where
     fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
     all the subpages within the 2M range are either hole-punched, or the
     guest is shut down, so in that way we never have to split it. Sean
     was pondering something similar in another thread:

       https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C3aba56bf7d574c749ea708dabbfe2224%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638028997419628807%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=c7gSLjJEAxuX8xmMiTDMUHNwUdQNKN00xqtAZAEeow8%3D&amp;reserved=0

     Issuing invalidations with folio-granularity ties in fairly well
     with this sort of approach if we end up going that route.

There is semantics difference between the current one and the proposed
one: The invalidation range is exactly what userspace passed down to the
kernel (being fallocated) while the proposed one will be subset of that
(if userspace-provided addr/size is not aligned to power of two), I'm
not quite confident this difference has no side effect.

In theory userspace should not be allocating/hole-punching restricted
pages for GPA ranges that are already mapped as private in the xarray,
and KVM could potentially fail such requests (though it does currently).

But if we somehow enforced that, then we could rely on
KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
which would free up the restricted fd invalidation callbacks to be used
purely to handle doing things like RMP/directmap fixups prior to returning
restricted pages back to the host. So that was sort of my thinking why the
new semantics would still cover all the necessary cases.

-Mike

quoted

I need to rework things for v9, and we'll probably want to use struct
folio instead of struct page now, but as a proof-of-concept of sorts this
is what I'd added on top of v8 of your patchset to implement 1) and 2):

  https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C3aba56bf7d574c749ea708dabbfe2224%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638028997419628807%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=jOFT0iLmeU7rKniEkWOsTf2%2FPI13EAw4Qm7arI1q970%3D&amp;reserved=0

Does an approach like this seem reasonable? Should be work this into the
base restricted memslot support?

If the above mentioned semantics difference is not a problem, I don't
have strong objection on this.

Sean, since you have much better understanding on this, what is your
take on this?

Chao

quoted

Thanks,

Mike

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help