Thread (97 messages) 97 messages, 14 authors, 2022-11-03

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

From: Fuad Tabba <hidden>
Date: 2022-09-23 15:21:17
Also in: kvm, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel

Hi,

<...>
quoted
Regarding pKVM's use case, with the shim approach I believe this can be done by
allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
piled on top.

My first thought was to make the uAPI a set of KVM ioctls so that KVM
could tightly
tightly control usage without taking on too much complexity in the
kernel, but
working through things, routing the behavior through the shim itself
might not be
all that horrific.

IIRC, we discarded the idea of allowing userspace to map the "private"
fd because
things got too complex, but with the shim it doesn't seem _that_ bad.
What's the exact use case?  Is it just to pre-populate the memory?
Prepopulate memory and access memory that could go back and forth from
being shared to being private.

Cheers,
/fuad


quoted
E.g. on the memfd side:

  1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
     mapping is all or nothing.

  2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
     the restricted memfd.

  3. Add notifier hooks to allow downstream users to further restrict things.

  4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
     one shot.

  5. Require that there are no outstanding references at munmap().  Or if this
     can't be guaranteed by userspace, maybe add some way for userspace to wait
     until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
     to do an expensive check every time.
Hmm.  I haven't looked at the code to see if this would really work, but I think this could be done more in line with how the rest of the kernel works by using the rmap infrastructure.  When the pKVM memfd is in not-yet-private mode, just let it be mmapped as usual (but don't allow any form of GUP or pinning).  Then have an ioctl to switch to to shared mode that takes locks or sets flags so that no new faults can be serviced and does unmap_mapping_range.

As long as the shim arranges to have its own vm_ops, I don't immediately see any reason this can't work.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help