Thread (97 messages) 97 messages, 14 authors, 2022-11-03

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

From: Vishal Annapurve <hidden>
Date: 2022-10-18 13:42:32
Also in: kvm, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel

On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
[off-list ref] wrote:
On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
quoted
On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
quoted
On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
quoted
On 9/15/22 16:29, Chao Peng wrote:
quoted
From: "Kirill A. Shutemov" <redacted>

KVM can use memfd-provided memory for guest memory. For normal userspace
accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
virtual address space and then tells KVM to use the virtual address to
setup the mapping in the secondary page table (e.g. EPT).

With confidential computing technologies like Intel TDX, the
memfd-provided memory may be encrypted with special key for special
software domain (e.g. KVM guest) and is not expected to be directly
accessed by userspace. Precisely, userspace access to such encrypted
memory may lead to host crash so it should be prevented.

This patch introduces userspace inaccessible memfd (created with
MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
ordinary MMU access (e.g. read/write/mmap) but can be accessed via
in-kernel interface so KVM can directly interact with core-mm without
the need to map the memory into KVM userspace.

It provides semantics required for KVM guest private(encrypted) memory
support that a file descriptor with this flag set is going to be used as
the source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV.

KVM userspace is still in charge of the lifecycle of the memfd. It
should pass the opened fd to KVM. KVM uses the kernel APIs newly added
in this patch to obtain the physical memory address and then populate
the secondary page table entries.

The userspace inaccessible memfd can be fallocate-ed and hole-punched
from userspace. When hole-punching happens, KVM can get notified through
inaccessible_notifier it then gets chance to remove any mapped entries
of the range in the secondary page tables.

The userspace inaccessible memfd itself is implemented as a shim layer
on top of real memory file systems like tmpfs/hugetlbfs but this patch
only implemented tmpfs. The allocated memory is currently marked as
unmovable and unevictable, this is required for current confidential
usage. But in future this might be changed.

Signed-off-by: Kirill A. Shutemov <redacted>
Signed-off-by: Chao Peng <redacted>
---
...
quoted
+static long inaccessible_fallocate(struct file *file, int mode,
+                                  loff_t offset, loff_t len)
+{
+       struct inaccessible_data *data = file->f_mapping->private_data;
+       struct file *memfd = data->memfd;
+       int ret;
+
+       if (mode & FALLOC_FL_PUNCH_HOLE) {
+               if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+                       return -EINVAL;
+       }
+
+       ret = memfd->f_op->fallocate(memfd, mode, offset, len);
+       inaccessible_notifier_invalidate(data, offset, offset + len);
Wonder if invalidate should precede the actual hole punch, otherwise we open
a window where the page tables point to memory no longer valid?
Yes, you are right. Thanks for catching this.
I also noticed this. But then thought the memory would be anyways zeroed
(hole punched) before this call?
Hole punching can free pages, given that offset/len covers full page.

--
  Kiryl Shutsemau / Kirill A. Shutemov
I think moving this notifier_invalidate before fallocate may not solve
the problem completely. Is it possible that between invalidate and
fallocate, KVM tries to handle the page fault for the guest VM from
another vcpu and uses the pages to be freed to back gpa ranges? Should
hole punching here also update mem_attr first to say that KVM should
consider the corresponding gpa ranges to be no more backed by
inaccessible memfd?
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help