Thread (101 messages) 101 messages, 16 authors, 2022-12-02

Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

From: Alex Bennée <hidden>
Date: 2022-11-16 09:53:58
Also in: kvm, linux-arch, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel

Chao Peng [off-list ref] writes:
On Mon, Nov 14, 2022 at 11:43:37AM +0000, Alex Bennée wrote:
quoted
Chao Peng [off-list ref] writes:

<snip>
quoted
Introduction
============
KVM userspace being able to crash the host is horrible. Under current
KVM architecture, all guest memory is inherently accessible from KVM
userspace and is exposed to the mentioned crash issue. The goal of this
series is to provide a solution to align mm and KVM, on a userspace
inaccessible approach of exposing guest memory. 

Normally, KVM populates secondary page table (e.g. EPT) by using a host
virtual address (hva) from core mm page table (e.g. x86 userspace page
table). This requires guest memory being mmaped into KVM userspace, but
this is also the source where the mentioned crash issue can happen. In
theory, apart from those 'shared' memory for device emulation etc, guest
memory doesn't have to be mmaped into KVM userspace.

This series introduces fd-based guest memory which will not be mmaped
into KVM userspace. KVM populates secondary page table by using a
fd/offset pair backed by a memory file system. The fd can be created
from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
directly interact with them with newly introduced in-kernel interface,
therefore remove the KVM userspace from the path of accessing/mmaping
the guest memory. 

Kirill had a patch [2] to address the same issue in a different way. It
tracks guest encrypted memory at the 'struct page' level and relies on
HWPOISON to reject the userspace access. The patch has been discussed in
several online and offline threads and resulted in a design document [3]
which is also the original proposal for this series. Later this patch
series evolved as more comments received in community but the major
concepts in [3] still hold true so recommend reading.

The patch series may also be useful for other usages, for example, pure
software approach may use it to harden itself against unintentional
access to guest memory. This series is designed with these usages in
mind but doesn't have code directly support them and extension might be
needed.
There are a couple of additional use cases where having a consistent
memory interface with the kernel would be useful.
Thanks very much for the info. But I'm not so confident that the current
memfd_restricted() implementation can be useful for all these usages. 
quoted
  - Xen DomU guests providing other domains with VirtIO backends

  Xen by default doesn't give other domains special access to a domains
  memory. The guest can grant access to regions of its memory to other
  domains for this purpose. 
I'm trying to form my understanding on how this could work and what's
the benefit for a DomU guest to provide memory through memfd_restricted().
AFAICS, memfd_restricted() can help to hide the memory from DomU userspace,
but I assume VirtIO backends are still in DomU uerspace and need access
that memory, right?
They need access to parts of the memory. At the moment you run your
VirtIO domains in the Dom0 and give them access to the whole of a DomU's
address space - however the Xen model is by default the guests memory is
inaccessible to other domains on the system. The DomU guest uses the Xen
grant model to expose portions of its address space to other domains -
namely for the VirtIO queues themselves and any pages containing buffers
involved in the VirtIO transaction. My thought was that looks like a
guest memory interface which is mostly inaccessible (private) with some
holes in it where memory is being explicitly shared with other domains.

What I want to achieve is a common userspace API with defined semantics
for what happens when private and shared regions are accessed. Because
having each hypervisor/confidential computing architecture define its
own special API for accessing this memory is just a recipe for
fragmentation and makes sharing common VirtIO backends impossible.
quoted
  - pKVM on ARM

  Similar to Xen, pKVM moves the management of the page tables into the
  hypervisor and again doesn't allow those domains to share memory by
  default.
Right, we already had some discussions on this in the past versions.
quoted
  - VirtIO loopback

  This allows for VirtIO devices for the host kernel to be serviced by
  backends running in userspace. Obviously the memory userspace is
  allowed to access is strictly limited to the buffers and queues
  because giving userspace unrestricted access to the host kernel would
  have consequences.
Okay, but normal memfd_create() should work for it, right? And
memfd_restricted() instead may not work as it unmaps the memory from
userspace.
quoted
All of these VirtIO backends work with vhost-user which uses memfds to
pass references to guest memory from the VMM to the backend
implementation.
Sounds to me these are the places where normal memfd_create() can act on.
VirtIO backends work on the mmap-ed memory which currently is not the
case for memfd_restricted(). memfd_restricted() has different design
purpose that unmaps the memory from userspace and employs some kernel
callbacks so other kernel modules can make use of the memory with these
callbacks instead of userspace virtual address.
Maybe my understanding is backwards then. Are you saying a guest starts
with all its memory exposed and then selectively unmaps the private
regions? Is this driven by the VMM or the guest itself?

-- 
Alex Bennée
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help