Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
From: Chao Peng <hidden>
Date: 2022-09-26 15:45:05
Also in:
kvm, linux-api, linux-fsdevel, linux-mm, lkml, qemu-devel
On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
quoted
Regarding pKVM's use case, with the shim approach I believe this can be done by allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions piled on top. My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly tightly control usage without taking on too much complexity in the kernel, but working through things, routing the behavior through the shim itself might not be all that horrific. IIRC, we discarded the idea of allowing userspace to map the "private" fd because things got too complex, but with the shim it doesn't seem _that_ bad. E.g. on the memfd side: 1. The entire memfd must be mapped, and at most one mapping is allowed, i.e. mapping is all or nothing. 2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for the restricted memfd. 3. Add notifier hooks to allow downstream users to further restrict things. 4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in one shot. 5. Require that there are no outstanding references at munmap(). Or if this can't be guaranteed by userspace, maybe add some way for userspace to wait until it's ok to convert to private? E.g. so that get_pfn() doesn't need to do an expensive check every time. static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma) { if (vma->vm_pgoff) return -EINVAL; if ((vma->vm_end - vma->vm_start) != <file size>) return -EINVAL; mutex_lock(&data->lock); if (data->has_mapping) { r = -EINVAL; goto err; } list_for_each_entry(notifier, &data->notifiers, list) { r = notifier->ops->mmap_start(notifier, ...); if (r) goto abort; } notifier->ops->mmap_end(notifier, ...); mutex_unlock(&data->lock); return 0; abort: list_for_each_entry_continue_reverse(notifier &data->notifiers, list) notifier->ops->mmap_abort(notifier, ...); err: mutex_unlock(&data->lock); return r; } static void memfd_restricted_close(struct vm_area_struct *vma) { mutex_lock(...); /* * Destroy the memfd and disable all future accesses if there are * outstanding refcounts (or other unsatisfied restrictions?). */ if (<outstanding references> || ???) memfd_restricted_destroy(...); else data->has_mapping = false; mutex_unlock(...); } static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr) { return -EINVAL; } static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma) { return -EINVAL; } Then on the KVM side, its mmap_start() + mmap_end() sequence would: 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero memory into the guest (after pre-boot phase). 2. Be mutually exclusive with shared<=>private conversions, and is allowed if and only if the entire gfn range of the associated memslot is shared.In general I think that this would work with pKVM. However, limiting private<->shared conversions to the granularity of a whole memslot might be difficult to handle in pKVM, since the guest doesn't have the concept of memslots. For example, in pKVM right now, when a guest shares back its restricted DMA pool with the host it does so at the page-level. pKVM would also need a way to make an fd accessible again when shared back, which I think isn't possible with this patch.
But does pKVM really want to mmap/munmap a new region at the page-level, that can cause VMA fragmentation if the conversion is frequent as I see. Even with a KVM ioctl for mapping as mentioned below, I think there will be the same issue.
You were initially considering a KVM ioctl for mapping, which might be better suited for this since KVM knows which pages are shared and which ones are private. So routing things through KVM might simplify things and allow it to enforce all the necessary restrictions (e.g., private memory cannot be mapped). What do you think? Thanks, /fuad