Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for... | linux-mips

Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

From: Ackerley Tng <hidden>
Date: 2023-08-21 17:30:13
Also in: kvm, kvm-riscv, kvmarm, linux-arm-kernel, linux-fsdevel, linux-mm, linux-riscv, linux-security-module, linuxppc-dev, lkml

Possibly related (same subject, not in this thread)

2023-09-15 · Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory · Sean Christopherson <seanjc@google.com>
2023-09-14 · Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory · Ackerley Tng <hidden>
2023-09-14 · Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory · Sean Christopherson <seanjc@google.com>
2023-09-14 · Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory · Sean Christopherson <seanjc@google.com>
2023-09-01 · Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory · Ackerley Tng <hidden>

Sean Christopherson [off-list ref] writes:

On Tue, Aug 15, 2023, Ackerley Tng wrote:

quoted

Sean Christopherson [off-list ref] writes:

quoted

I feel that memslots form a natural way of managing usage of the gmem
file. When a memslot is created, it is using the file; hence we take a
refcount on the gmem file, and as memslots are removed, we drop
refcounts on the gmem file.

Yes and no.  It's definitely more natural *if* the goal is to allow guest_memfd
memory to exist without being attached to a VM.  But I'm not at all convinced
that we want to allow that, or that it has desirable properties.  With TDX and
SNP in particuarly, I'm pretty sure that allowing memory to outlive the VM is
very underisable (more below).

This is a little confusing, with the file/inode split in gmem where the
physical memory/data is attached to the inode and the file represents
the VM's view of that memory, won't the memory outlive the VM?

Doh, I overloaded the term "VM".  By "VM" I meant the virtual machine as a "thing"
the rest of the world sees and interacts with, not the original "struct kvm" object.

Because yes, you're absolutely correct that the memory will outlive "struct kvm",
but it won't outlive the virtual machine, and specifically won't outlive the
ASID (SNP) / HKID (TDX) to which it's bound.

Yup we agree on this now :) The memory should not outlive the the ASID
(SNP) / HKID (TDX) to which it's bound.

quoted

This [1] POC was built based on that premise, that the gmem inode can be
linked to another file and handed off to another VM, to facilitate
intra-host migration, where the point is to save the work of rebuilding
the VM's memory in the destination VM.

With this, the bindings don't outlive the VM, but the data/memory
does. I think this split design you proposed is really nice.

quoted

The KVM pointer is shared among all the bindings in gmem’s xarray, and we can
enforce that a gmem file is used only with one VM:

+ When binding a memslot to the file, if a kvm pointer exists, it must
  be the same kvm as the one in this binding
+ When the binding to the last memslot is removed from a file, NULL the
  kvm pointer.

Nullifying the KVM pointer isn't sufficient, because without additional actions
userspace could extract data from a VM by deleting its memslots and then binding
the guest_memfd to an attacker controlled VM.  Or more likely with TDX and SNP,
induce badness by coercing KVM into mapping memory into a guest with the wrong
ASID/HKID.

I can think of three ways to handle that:

  (a) prevent a different VM from *ever* binding to the gmem instance
  (b) free/zero physical pages when unbinding
  (c) free/zero when binding to a different VM

Option (a) is easy, but that pretty much defeats the purpose of decopuling
guest_memfd from a VM.

Option (b) isn't hard to implement, but it screws up the lifecycle of the memory,
e.g. would require memory when a memslot is deleted.  That isn't necessarily a
deal-breaker, but it runs counter to how KVM memlots currently operate.  Memslots
are basically just weird page tables, e.g. deleting a memslot doesn't have any
impact on the underlying data in memory.  TDX throws a wrench in this as removing
a page from the Secure EPT is effectively destructive to the data (can't be mapped
back in to the VM without zeroing the data), but IMO that's an oddity with TDX and
not necessarily something we want to carry over to other VM types.

There would also be performance implications (probably a non-issue in practice),
and weirdness if/when we get to sharing, linking and/or mmap()ing gmem.  E.g. what
should happen if the last memslot (binding) is deleted, but there outstanding userspace
mappings?

Option (c) is better from a lifecycle perspective, but it adds its own flavor of
complexity, e.g. the performant way to reclaim TDX memory requires the TDMR
(effectively the VM pointer), and so a deferred relcaim doesn't really work for
TDX.  And I'm pretty sure it *can't* work for SNP, because RMP entries must not
outlive the VM; KVM can't reuse an ASID if there are pages assigned to that ASID
in the RMP, i.e. until all memory belonging to the VM has been fully freed.

If we are on the same page that the memory should outlive the VM but not
the bindings, then associating the gmem inode to a new VM should be a
feature and not a bug.

What do we want to defend against here?

(a) Malicious host VMM

For a malicious host VMM to read guest memory (with TDX and SNP), it can
create a new VM with the same HKID/ASID as the victim VM, rebind the
gmem inode to a VM crafted with an image that dumps the memory.

I believe it is not possible for userspace to arbitrarily select a
matching HKID unless userspace uses the intra-host migration ioctls, but if the
migration ioctl is used, then EPTs are migrated and the memory dumper VM
can't successfully run a different image from the victim VM. If the
dumper VM needs to run the same image as the victim VM, then it would be
a successful migration rather than an attack. (Perhaps we need to clean
up some #MCs here but that can be a separate patch).

From a guest security perspective, throw TDX and SNP out the window.  As far as
the design of guest_memfd is concerned, I truly do not care what security properties
they provide, I only care about whether or not KVM's support for TDX and SNP is
clean, robust, and functionally correct.

Note, I'm not saying I don't care about TDX/SNP.  What I'm saying is that I don't
want to design something that is beneficial only to what is currently a very
niche class of VMs that require specific flavors of hardware.

quoted

(b) Malicious host kernel

A malicious host kernel can allow a malicious host VMM to re-use a HKID
for the dumper VM, but this isn't something a better gmem design can
defend against.

Yep, completely out-of-scope.

quoted

(c) Attacks using gmem for software-protected VMs

Attacks using gmem for software-protected VMs are possible since there
is no real encryption with HKID/ASID (yet?). The selftest for [1]
actually uses this lack of encryption to test that the destination VM
can read the source VM's memory after the migration. In the POC [1], as
long as both destination VM knows where in the inode's memory to read,
it can read what it wants to.

Encryption is not required to protect guest memory from less privileged software.
The selftests don't rely on lack of encryption, they rely on KVM incorporating
host userspace into the TCB.

Just because this RFC doesn't remove the VMM from the TCB for SW-protected VMS,
doesn't mean we _can't_ remove the VMM from the TCB.  pKVM has already shown that
such an implementation is possible.  We didn't tackle pKVM-like support in the
initial implementation because it's non-trivial, doesn't yet have a concrete use
case to fund/drive development, and would have significantly delayed support for
the use cases people do actually care about.

There are certainly benefits from memory being encrypted, but it's neither a
requirement nor a panacea, as proven by the never ending stream of speculative
execution attacks.

quoted

This is a problem for software-protected VMs, but I feel that it is also a
separate issue from gmem's design.

No, I don't want guest_memfd to be just be a vehicle for SNP/TDX VMs.  Having line
of sight to removing host userspace from the TCB is absolutely a must have for me,
and having line of sight to improving KVM's security posture for "regular" VMs is
even more of a must have.  If guest_memfd doesn't provide us a very direct path to
(eventually) achieving those goals, then IMO it's a failure.

Which leads me to:

(d) Buggy components

Today, for all intents and purposes, guest memory *must* be mapped writable in
the VMM, which means it is all too easy for a benign-but-buggy host component to
corrupt guest memory.  There are ways to mitigate potential problems, e.g. by
developing userspace to adhere to the principle of least privilege inasmuch as
possible, but such mitigations would be far less robust than what can be achieved
via guest_memfd, and practically speaking I don't see us (Google, but also KVM in
general) making progress on deprivileging userspace without forcing the issue.

Thanks for adding this point! I should clarify that when I asked about
what we want to defend against, I meant that in response to the point
that nulling the KVM pointer is insufficient. IIUC (d) explains what the
whole of gmem is meant to defend against.

I agree with you that nulling the KVM pointer is insufficient to keep
host userspace out of the TCB. Among the three options (a) preventing a
different VM (HKID/ASID) from binding to the gmem instance, or zeroing
the memory either (b) on unbinding, or (c) on binding to another VM
(HKID/ASID),

(a) sounds like adding a check issued to TDX/SNP upon binding and this
    check would just return OK for software-protected VMs (line of sight
    to removing host userspace from TCB).

Or, we could go further for software-protected VMs and add tracking in
the inode to prevent the same inode from being bound to different
"HKID/ASID"s, perhaps like this:

+ On first binding, store the KVM pointer in the inode - not file (but
  not hold a refcount)
+ On rebinding, check that the KVM matches the pointer in the inode
+ On intra-host migration, update the KVM pointer in the inode to allow
  binding to the new struct kvm

I think you meant associating the file with a struct kvm at creation
time as an implementation for (a), but technically since the inode is
the representation of memory, tracking of struct kvm should be with the
inode instead of the file.

(b) You're right that this messes up the lifecycle of the memory and
    wouldn't work with intra-host migration.

(c) sounds like doing the clearing on a check similar to that of (a)

If we track struct kvm with the inode, then I think (a), (b) and (c) can
be independent of the refcounting method. What do you think?

quoted

Could binding gmem files not on creation, but at memslot configuration
time be sufficient and simpler?

After working through the flows, I think binding on-demand would simplify the
refcounting (stating the obvious), but complicate the lifecycle of the memory as
well as the contract between KVM and userspace,

If we are on the same page that the memory should outlive the VM but not
the bindings, does it still complicate the lifecycle of the memory and
the userspace/KVM contract? Could it just be a different contract?

Not entirely sure I understand what you're asking.  Does this question go away
with my clarification about struct kvm vs. virtual machine?

Yes, this question goes away. Thanks!

quoted

and would break the separation of
concerns between the inode (physical memory / data) and file (VM's view / mappings).

Binding on-demand is orthogonal to the separation of concerns between
inode and file, because it can be built regardless of whether we do the
gmem file/inode split.

+ This flip-the-refcounting POC is built with the file/inode split and
+ In [2] (the delayed binding approach to solve intra-host migration), I
  also tried flipping the refcounting, and that without the gmem
  file/inode split. (Refcounting in [2] is buggy because the file can't
  take a refcount on KVM, but it would work without taking that refcount)

[1] https://lore.kernel.org/lkml/cover.1691446946.git.ackerleytng@google.com/T/ (local)
[2] https://github.com/googleprodkernel/linux-cc/commit/dd5ac5e53f14a1ef9915c9c1e4cc1006a40b49df

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help