Thread (101 messages) 101 messages, 16 authors, 2022-12-02

Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

From: Michael Roth <hidden>
Date: 2022-11-02 21:20:05
Also in: kvm, linux-arch, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel

On Wed, Nov 02, 2022 at 10:53:25PM +0800, Chao Peng wrote:
On Tue, Nov 01, 2022 at 02:30:58PM -0500, Michael Roth wrote:
quoted
On Tue, Nov 01, 2022 at 10:19:44AM -0500, Michael Roth wrote:
quoted
On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
quoted
On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
quoted
On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
quoted
  3) Potentially useful for hugetlbfs support:

     One issue with hugetlbfs is that we don't support splitting the
     hugepage in such cases, which was a big obstacle prior to UPM. Now
     however, we may have the option of doing "lazy" invalidations where
     fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
     all the subpages within the 2M range are either hole-punched, or the
     guest is shut down, so in that way we never have to split it. Sean
     was pondering something similar in another thread:

       https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C13192ae987b442f10b7408dabce2a4c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029978853935768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Is%2Bfm3c9BGFmU%2Btn3ZgPPQnUeCK%2BhKPArsPrWY5JeSg%3D&amp;reserved=0

     Issuing invalidations with folio-granularity ties in fairly well
     with this sort of approach if we end up going that route.
There is semantics difference between the current one and the proposed
one: The invalidation range is exactly what userspace passed down to the
kernel (being fallocated) while the proposed one will be subset of that
(if userspace-provided addr/size is not aligned to power of two), I'm
not quite confident this difference has no side effect.
In theory userspace should not be allocating/hole-punching restricted
pages for GPA ranges that are already mapped as private in the xarray,
and KVM could potentially fail such requests (though it does currently).

But if we somehow enforced that, then we could rely on
KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
which would free up the restricted fd invalidation callbacks to be used
purely to handle doing things like RMP/directmap fixups prior to returning
restricted pages back to the host. So that was sort of my thinking why the
new semantics would still cover all the necessary cases.
Sorry, this explanation is if we rely on userspace to fallocate() on 2MB
boundaries, and ignore any non-aligned requests in the kernel. But
that's not how I actually ended up implementing things, so I'm not sure
why answered that way...

In my implementation we actually do issue invalidations for fallocate()
even for non-2M-aligned GPA/offset ranges. For instance (assuming
restricted FD offset 0 corresponds to GPA 0), an fallocate() on GPA
range 0x1000-0x402000 would result in the following invalidations being
issued if everything was backed by a 2MB page:

  invalidate GPA: 0x001000-0x200000, Page: pfn_to_page(I), order:9
  invalidate GPA: 0x200000-0x400000, Page: pfn_to_page(J), order:9
  invalidate GPA: 0x400000-0x402000, Page: pfn_to_page(K), order:9
Only see this I understand what you are actually going to propose;)

So the memory range(start/end) will be still there and covers exactly
what it should be from usrspace point of view, the page+order(or just
folio) is really just a _hint_ for the invalidation callbacks. Looks
ugly though.
Yes that's accurate: callbacks still need to handle partial ranges, so
it's more of a hint/optimization for cases where callbacks can benefit
from knowing the entire backing hugepage is being invalidated/freed.
In v9 we use a invalidate_start/ invalidate_end pair to solve a race
contention issue(https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fkvm%2FY1LOe4JvnTbFNs4u%40google.com%2F&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C13192ae987b442f10b7408dabce2a4c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029978853935768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=zccj0lNcqBCxGVGLBYAD2BCkJuy75nTxFTSUMfDJjzM%3D&amp;reserved=0).
To work with this, I believe we only need pass this hint info for
invalidate_start() since at the invalidate_end() time, the page has
already been discarded.
Ok, yah, that's the approach I'm looking at for v9: pass the page/order
for invalidate_start, but keep invalidate_end as-is.
Another worth-mentioning-thing is invalidate_start/end is not just
invoked for hole punching, but also for allocation(e.g. default
fallocate). While for allocation we can get the page only at the
invalidate_end() time. But AFAICS, the invalidate() is called for
fallocate(allocation) is because previously we rely on the existence in
memory backing store to tell a page is private and we need notify KVM
that the page is being converted from shared to private, but that is not
true for current code and fallocate() is also not mandatory since KVM
can call restrictedmem_get_page() to allocate dynamically, so I think we
can remove the invalidation path for fallocate(allocation).
I actually ended up doing that for the v8 implementation, I figured it
was a holdover from before {REG,UNREG}_REGION were used, but too sure on
that so good to have some confirmation there.
quoted
So you still cover the same range, but the arch/platform callbacks can
then, as a best effort, do things like restore 2M directmap if they see
that the backing page is 2MB+ and the GPA range covers the entire range.
If the GPA doesn't covers the whole range, or the backing page is
order:0, then in that case we are still forced to leave the directmap
split.

But with that in place we can then improve on that by allowing for the
use of hugetlbfs.

We'd still be somewhat reliant on userspace to issue fallocate()'s on
2M-aligned boundaries to some degree (guest teardown invalidations
could be issued as 2M-aligned, which would be the bulk of the pages
in most cases, but for discarding pages after private->shared
conversion we could still get fragmentation). This could maybe be
addressed by keeping track of those partial/non-2M-aligned fallocate()
requests and then issuing them as a batched 2M invalidation once all
the subpages have been fallocate(HOLE_PUNCH)'d. We'd need to enforce
that fallocate(PUNCH_HOLE) is preceeded by
KVM_MEMORY_ENCRYPT_UNREG_REGION to make sure MMU invalidations happen
though.
Don't understand why the sequence matters here, we should do MMU
invalidation for both fallocate(PUNCH_HOLE) and
KVM_MEMORY_ENCRYPT_UNREG_REGION, right?
It should happen in both places as long as it's possible for userspace
to fallocate(PUNCH_HOLE) a private page while it is mapped to a guest.
I'm not necessarily suggesting we should make any changes there right
now, but...

We might need to consider changing that if we decide that we don't want
to allow userspace to basically force splitting the directmap by always
issuing fallocate(PUNCH_HOLE) for each 4K page rather than trying to do
it in 2M intervals when possible, since it would still result in 4K
invalidations being issued, such that optimizations like restoring the
2M directmap can't be done, even when backed by THPs or hugetlbfs pages.
One approach to deal with this is to introduce a bitmap (for instance) to
track what subpages have been fallocate(PUNCH_HOLE)'d, and defer the
actual free'ing/invalidation until a whole page has been marked for
deallocation. This would keep huge-pages/huge-invalidations intact even
if userspace is malicious/non-optimal and actively trying to slow the
host down by forcing the directmap to get split.

*If* we took that approach though, then the MMU invalidations would also
get deferred, which is bad. But if we added a check/callback that
restrictedfd.c could use to confirm that the page is already in a
shared/non-private state, then we'd know the MMU invalidation for the
private page must have already happened, so if anything got faulted in
afterward it should be a shared page. (Though I guess update_mem_attr
would also need to check this bitmap and fail for cases where a
shared->private conversion is issued for a page/range that's been
recorded as having been previously issued a deferred PUNCH_HOLE'd.

But that's only an optimization and probably needs a lot more thought,
not necessarily something I think we need to implement now.

-Mike
Thanks,
Chao
quoted
Not sure on these potential follow-ups, but they all at least seem
compatible with the proposed invalidation scheme.

-Mike
quoted
-Mike
quoted
quoted
I need to rework things for v9, and we'll probably want to use struct
folio instead of struct page now, but as a proof-of-concept of sorts this
is what I'd added on top of v8 of your patchset to implement 1) and 2):

  https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2F127e5ea477c7bd5e4107fd44a04b9dc9e9b1af8b&amp;data=05%7C01%7Cmichael.roth%40amd.com%7C13192ae987b442f10b7408dabce2a4c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029978854091987%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=ghvLOpeqPz%2B6G593enT0p%2F3Ovh9rjHKtsuSQ2xObFHU%3D&amp;reserved=0

Does an approach like this seem reasonable? Should be work this into the
base restricted memslot support?
If the above mentioned semantics difference is not a problem, I don't
have strong objection on this.

Sean, since you have much better understanding on this, what is your
take on this?

Chao
quoted
Thanks,

Mike
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help