Thread (97 messages) 97 messages, 14 authors, 2022-11-03

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

From: Chao Peng <hidden>
Date: 2022-09-26 15:45:05
Also in: kvm, linux-api, linux-fsdevel, linux-mm, lkml, qemu-devel

On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
quoted
Regarding pKVM's use case, with the shim approach I believe this can be done by
allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
piled on top.

My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly
tightly control usage without taking on too much complexity in the kernel, but
working through things, routing the behavior through the shim itself might not be
all that horrific.

IIRC, we discarded the idea of allowing userspace to map the "private" fd because
things got too complex, but with the shim it doesn't seem _that_ bad.

E.g. on the memfd side:

  1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
     mapping is all or nothing.

  2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
     the restricted memfd.

  3. Add notifier hooks to allow downstream users to further restrict things.

  4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
     one shot.

  5. Require that there are no outstanding references at munmap().  Or if this
     can't be guaranteed by userspace, maybe add some way for userspace to wait
     until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
     to do an expensive check every time.

  static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma)
  {
        if (vma->vm_pgoff)
                return -EINVAL;

        if ((vma->vm_end - vma->vm_start) != <file size>)
                return -EINVAL;

        mutex_lock(&data->lock);

        if (data->has_mapping) {
                r = -EINVAL;
                goto err;
        }
        list_for_each_entry(notifier, &data->notifiers, list) {
                r = notifier->ops->mmap_start(notifier, ...);
                if (r)
                        goto abort;
        }

        notifier->ops->mmap_end(notifier, ...);
        mutex_unlock(&data->lock);
        return 0;

  abort:
        list_for_each_entry_continue_reverse(notifier &data->notifiers, list)
                notifier->ops->mmap_abort(notifier, ...);
  err:
        mutex_unlock(&data->lock);
        return r;
  }

  static void memfd_restricted_close(struct vm_area_struct *vma)
  {
        mutex_lock(...);

        /*
         * Destroy the memfd and disable all future accesses if there are
         * outstanding refcounts (or other unsatisfied restrictions?).
         */
        if (<outstanding references> || ???)
                memfd_restricted_destroy(...);
        else
                data->has_mapping = false;

        mutex_unlock(...);
  }

  static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr)
  {
        return -EINVAL;
  }

  static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
  {
        return -EINVAL;
  }

Then on the KVM side, its mmap_start() + mmap_end() sequence would:

  1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
     memory into the guest (after pre-boot phase).

  2. Be mutually exclusive with shared<=>private conversions, and is allowed if
     and only if the entire gfn range of the associated memslot is shared.
In general I think that this would work with pKVM. However, limiting
private<->shared conversions to the granularity of a whole memslot
might be difficult to handle in pKVM, since the guest doesn't have the
concept of memslots. For example, in pKVM right now, when a guest
shares back its restricted DMA pool with the host it does so at the
page-level. pKVM would also need a way to make an fd accessible again
when shared back, which I think isn't possible with this patch.
But does pKVM really want to mmap/munmap a new region at the page-level,
that can cause VMA fragmentation if the conversion is frequent as I see.
Even with a KVM ioctl for mapping as mentioned below, I think there will
be the same issue.
You were initially considering a KVM ioctl for mapping, which might be
better suited for this since KVM knows which pages are shared and
which ones are private. So routing things through KVM might simplify
things and allow it to enforce all the necessary restrictions (e.g.,
private memory cannot be mapped). What do you think?

Thanks,
/fuad
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help