Thread (153 messages) 153 messages, 23 authors, 2023-05-23

Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

From: Isaku Yamahata <hidden>
Date: 2023-01-18 11:10:49
Also in: kvm, linux-arch, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel

On Wed, Jan 18, 2023 at 04:16:41PM +0800,
Chao Peng [off-list ref] wrote:
On Tue, Jan 17, 2023 at 04:34:15PM +0000, Sean Christopherson wrote:
quoted
On Tue, Jan 17, 2023, Chao Peng wrote:
quoted
On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote:
quoted
quoted
+	list_for_each_entry(notifier, &data->notifiers, list) {
+		notifier->ops->invalidate_start(notifier, start, end);
Two major design issues that we overlooked long ago:

  1. Blindly invoking notifiers will not scale.  E.g. if userspace configures a
     VM with a large number of convertible memslots that are all backed by a
     single large restrictedmem instance, then converting a single page will
     result in a linear walk through all memslots.  I don't expect anyone to
     actually do something silly like that, but I also never expected there to be
     a legitimate usecase for thousands of memslots.

  2. This approach fails to provide the ability for KVM to ensure a guest has
     exclusive access to a page.  As discussed in the past, the kernel can rely
     on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but
     only for SNP and TDX VMs.  For VMs where userspace is trusted to some extent,
     e.g. SEV, there is value in ensuring a 1:1 association.

     And probably more importantly, relying on hardware for SNP and TDX yields a
     poor ABI and complicates KVM's internals.  If the kernel doesn't guarantee a
     page is exclusive to a guest, i.e. if userspace can hand out the same page
     from a restrictedmem instance to multiple VMs, then failure will occur only
     when KVM tries to assign the page to the second VM.  That will happen deep
     in KVM, which means KVM needs to gracefully handle such errors, and it means
     that KVM's ABI effectively allows plumbing garbage into its memslots.
It may not be a valid usage, but in my TDX environment I do meet below
issue.

kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0
kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0
kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22

Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
and slot#2 fails due to below exclusive check.

Currently I changed QEMU code to mark these alias slots as shared
instead of private but I'm not 100% confident this is correct fix.
That's a QEMU bug of sorts.  SMM is mutually exclusive with TDX, QEMU shouldn't
be configuring SMRAM (or any SMM memslots for that matter) for TDX guests.
Thanks for the confirmation. As long as we only bind one notifier for
each address, using xarray does make things simple.
In the past, I had patches for qemu to disable PAM and SMRAM, but they were
dropped for simplicity because SMRAM/PAM are disabled as reset state with unused
memslot registered. TDX guest bios(TDVF or EDK2) doesn't enable them.
Now we can revive them.
-- 
Isaku Yamahata [off-list ref]
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help