Thread (101 messages) 101 messages, 16 authors, 2022-12-02

Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

From: Chao Peng <hidden>
Date: 2022-11-15 09:53:28
Also in: kvm, linux-arch, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel

On Mon, Nov 14, 2022 at 04:16:32PM -0600, Michael Roth wrote:
On Mon, Nov 14, 2022 at 06:28:43PM +0300, Kirill A. Shutemov wrote:
quoted
On Mon, Nov 14, 2022 at 03:02:37PM +0100, Vlastimil Babka wrote:
quoted
On 11/1/22 16:19, Michael Roth wrote:
quoted
On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
quoted
quoted
  1) restoring kernel directmap:

     Currently SNP (and I believe TDX) need to either split or remove kernel
     direct mappings for restricted PFNs, since there is no guarantee that
     other PFNs within a 2MB range won't be used for non-restricted
     (which will cause an RMP #PF in the case of SNP since the 2MB
     mapping overlaps with guest-owned pages)
Has the splitting and restoring been a well-discussed direction? I'm
just curious whether there is other options to solve this issue.
For SNP it's been discussed for quite some time, and either splitting or
removing private entries from directmap are the well-discussed way I'm
aware of to avoid RMP violations due to some other kernel process using
a 2MB mapping to access shared memory if there are private pages that
happen to be within that range.

In both cases the issue of how to restore directmap as 2M becomes a
problem.

I was also under the impression TDX had similar requirements. If so,
do you know what the plan is for handling this for TDX?

There are also 2 potential alternatives I'm aware of, but these haven't
been discussed in much detail AFAIK:

a) Ensure confidential guests are backed by 2MB pages. shmem has a way to
   request 2MB THP pages, but I'm not sure how reliably we can guarantee
   that enough THPs are available, so if we went that route we'd probably
   be better off requiring the use of hugetlbfs as the backing store. But
   obviously that's a bit limiting and it would be nice to have the option
   of using normal pages as well. One nice thing with invalidation
   scheme proposed here is that this would "Just Work" if implement
   hugetlbfs support, so an admin that doesn't want any directmap
   splitting has this option available, otherwise it's done as a
   best-effort.

b) Implement general support for restoring directmap as 2M even when
   subpages might be in use by other kernel threads. This would be the
   most flexible approach since it requires no special handling during
   invalidations, but I think it's only possible if all the CPA
   attributes for the 2M range are the same at the time the mapping is
   restored/unsplit, so some potential locking issues there and still
   chance for splitting directmap over time.
I've been hoping that

c) using a mechanism such as [1] [2] where the goal is to group together
these small allocations that need to increase directmap granularity so
maximum number of large mappings are preserved.
As I mentioned in the other thread the restricted memfd can be backed by
secretmem instead of plain memfd. It already handles directmap with care.
It looks like it would handle direct unmapping/cleanup nicely, but it
seems to lack fallocate(PUNCH_HOLE) support which we'd probably want to
avoid additional memory requirements. I think once we added that we'd
still end up needing some sort of handling for the invalidations.

Also, I know Chao has been considering hugetlbfs support, I assume by
leveraging the support that already exists in shmem. Ideally SNP would
be able to make use of that support as well, but relying on a separate
backend seems likely to result in more complications getting there
later.
quoted
But I don't think it has to be part of initial restricted memfd
implementation. It is SEV-specific requirement and AMD folks can extend
implementation as needed later.
Admittedly the suggested changes to the invalidation mechanism made a
lot more sense to me when I was under the impression that TDX would have
similar requirements and we might end up with a common hook. Since that
doesn't actually seem to be the case, it makes sense to try to do it as
a platform-specific hook for SNP.

I think, given a memslot, a GFN range, and kvm_restricted_mem_get_pfn(),
we should be able to get the same information needed to figure out whether
the range is backed by huge pages or not. I'll see how that works out
instead.
Sounds a viable solution, just that kvm_restricted_mem_get_pfn() will
only give you the ability to check a page, not a range. But you can
still call it many times I think.

The invalidation callback will be still needed, it gives you the chance
to do the restoring.

Chao
Thanks,

Mike
quoted
-- 
  Kiryl Shutsemau / Kirill A. Shutemov
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help