Thread (97 messages) 97 messages, 14 authors, 2022-11-03

RE: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

From: Wang, Wei W <hidden>
Date: 2022-09-22 13:24:02
Also in: kvm, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel

On Thursday, September 22, 2022 5:11 AM, Andy Lutomirski wrote:
To: Christopherson,, Sean <seanjc@google.com>; David Hildenbrand
[off-list ref]
Cc: Chao Peng <redacted>; kvm list
[off-list ref]; Linux Kernel Mailing List
[off-list ref]; linux-mm@kvack.org;
linux-fsdevel@vger.kernel.org; Linux API [off-list ref];
linux-doc@vger.kernel.org; qemu-devel@nongnu.org; Paolo Bonzini
[off-list ref]; Jonathan Corbet [off-list ref]; Vitaly
Kuznetsov [off-list ref]; Wanpeng Li [off-list ref];
Jim Mattson [off-list ref]; Joerg Roedel [off-list ref];
Thomas Gleixner [off-list ref]; Ingo Molnar [off-list ref];
Borislav Petkov [off-list ref]; the arch/x86 maintainers [off-list ref];
H. Peter Anvin [off-list ref]; Hugh Dickins [off-list ref]; Jeff
Layton [off-list ref]; J . Bruce Fields [off-list ref]; Andrew
Morton [off-list ref]; Shuah Khan [off-list ref];
Mike Rapoport [off-list ref]; Steven Price [off-list ref];
Maciej S . Szmigiero [off-list ref]; Vlastimil Babka
[off-list ref]; Vishal Annapurve [off-list ref]; Yu Zhang
[off-list ref]; Kirill A. Shutemov
[off-list ref]; Nakajima, Jun [off-list ref];
Hansen, Dave [off-list ref]; Andi Kleen [off-list ref];
aarcange@redhat.com; ddutile@redhat.com; dhildenb@redhat.com; Quentin
Perret [off-list ref]; Michael Roth [off-list ref];
Hocko, Michal [off-list ref]; Muchun Song
[off-list ref]; Wang, Wei W [off-list ref];
Will Deacon [off-list ref]; Marc Zyngier [off-list ref]; Fuad Tabba
[off-list ref]
Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible
memfd

(please excuse any formatting disasters.  my internet went out as I was
composing this, and i did my best to rescue it.)

On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote:
quoted
+Will, Marc and Fuad (apologies if I missed other pKVM folks)

On Mon, Sep 19, 2022, David Hildenbrand wrote:
quoted
On 15.09.22 16:29, Chao Peng wrote:
quoted
From: "Kirill A. Shutemov" <redacted>

KVM can use memfd-provided memory for guest memory. For normal
userspace accessible memory, KVM userspace (e.g. QEMU) mmaps the
memfd into its virtual address space and then tells KVM to use the
virtual address to setup the mapping in the secondary page table (e.g.
EPT).
quoted
quoted
quoted
With confidential computing technologies like Intel TDX, the
memfd-provided memory may be encrypted with special key for special
software domain (e.g. KVM guest) and is not expected to be directly
accessed by userspace. Precisely, userspace access to such
encrypted memory may lead to host crash so it should be prevented.
Initially my thaught was that this whole inaccessible thing is TDX
specific and there is no need to force that on other mechanisms.
That's why I suggested to not expose this to user space but handle
the notifier requirements internally.

IIUC now, protected KVM has similar demands. Either access
(read/write) of guest RAM would result in a fault and possibly crash
the hypervisor (at least not the whole machine IIUC).
Yep.  The missing piece for pKVM is the ability to convert from shared
to private while preserving the contents, e.g. to hand off a large
buffer (hundreds of MiB) for processing in the protected VM.  Thoughts
on this at the bottom.
quoted
quoted
This patch introduces userspace inaccessible memfd (created with
MFD_INACCESSIBLE). Its memory is inaccessible from userspace
through ordinary MMU access (e.g. read/write/mmap) but can be
accessed via in-kernel interface so KVM can directly interact with
core-mm without the need to map the memory into KVM userspace.
With secretmem we decided to not add such "concept switch" flags and
instead use a dedicated syscall.
I have no personal preference whatsoever between a flag and a
dedicated syscall, but a dedicated syscall does seem like it would
give the kernel a bit more flexibility.
The third option is a device node, e.g. /dev/kvm_secretmem or
/dev/kvm_tdxmem or similar.  But if we need flags or other details in the
future, maybe this isn't ideal.
quoted
quoted
What about memfd_inaccessible()? Especially, sealing and hugetlb are
not even supported and it might take a while to support either.
Don't know about sealing, but hugetlb support for "inaccessible"
memory needs to come sooner than later.  "inaccessible" in quotes
because we might want to choose a less binary name, e.g.
"restricted"?.

Regarding pKVM's use case, with the shim approach I believe this can
be done by allowing userspace mmap() the "hidden" memfd, but with a
ton of restrictions piled on top.

My first thought was to make the uAPI a set of KVM ioctls so that KVM
could tightly tightly control usage without taking on too much
complexity in the kernel, but working through things, routing the
behavior through the shim itself might not be all that horrific.

IIRC, we discarded the idea of allowing userspace to map the "private"
fd because
things got too complex, but with the shim it doesn't seem _that_ bad.
What's the exact use case?  Is it just to pre-populate the memory?
Add one more use case here. For TDX live migration support, on the destination side,
we map the private fd during migration to store the encrypted private memory data sent
from source, and at the end of migration, we unmap it and make it inaccessible before
resuming the TD to run.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help