Re: [PATCH 5/5] KVM: TDX: Add sub-ioctl KVM_TDX_TERMINATE_VM
From: Sean Christopherson <seanjc@google.com>
Date: 2025-08-01 16:44:20
Also in:
kvm, kvmarm, lkml
Subsystem:
kernel virtual machine (kvm), the rest · Maintainers:
Paolo Bonzini, Linus Torvalds
+Chao On Fri, Aug 01, 2025, Adrian Hunter wrote:
On 29/07/2025 22:33, Sean Christopherson wrote:quoted
+static int tdx_terminate_vm(struct kvm *kvm) +{ + if (kvm_trylock_all_vcpus(kvm)) + return -EBUSY; + + kvm_vm_dead(kvm); + to_kvm_tdx(kvm)->vm_terminated = true; + + kvm_unlock_all_vcpus(kvm); + + tdx_mmu_release_hkid(kvm); + + return 0; +}As I think I mentioned when removing vm_dead first came up, I think we need more checks. I spent some time going through the code and came up with what is below: First, we need to avoid TDX VCPU sub-IOCTLs from racing with tdx_mmu_release_hkid(). But having any TDX sub-IOCTL run after KVM_TDX_TERMINATE_VM raises questions of what might happen, so it is much simpler to understand, if that is not possible. There are 3 options: 1. Require that KVM_TDX_TERMINATE_VM is valid only if kvm_tdx->state == TD_STATE_RUNNABLE. Since currently all the TDX sub-IOCTLs are for initialization, that would block the opportunity for any to run after KVM_TDX_TERMINATE_VM. 2. Check vm_terminated in tdx_vm_ioctl() and tdx_vcpu_ioctl() 3. Test KVM_REQ_VM_DEAD in tdx_vm_ioctl() and tdx_vcpu_ioctl() [ Note cannot check is_hkid_assigned() because that is racy ] Secondly, I suggest we avoid SEAMCALLs that will fail and result in KVM_BUG_ON() if HKID has been released. There are 2 groups of those: MMU-related and TDVPS_ACCESSORS. For the MMU-related, the following 2 functions should return an error immediately if vm_terminated: tdx_sept_link_private_spt() tdx_sept_set_private_spte() For that not be racy, extra synchronization is needed so that vm_terminated can be reliably checked when holding mmu lock i.e. static int tdx_terminate_vm(struct kvm *kvm) { if (kvm_trylock_all_vcpus(kvm)) return -EBUSY; kvm_vm_dead(kvm); + + write_lock(&kvm->mmu_lock); to_kvm_tdx(kvm)->vm_terminated = true; + write_unlock(&kvm->mmu_lock); kvm_unlock_all_vcpus(kvm); tdx_mmu_release_hkid(kvm); return 0; } Finally, there are 2 TDVPS_ACCESSORS that need avoiding: tdx_load_mmu_pgd() skip td_vmcs_write64() if vm_terminated tdx_protected_apic_has_interrupt() skip td_state_non_arch_read64() if vm_terminated
Oof. And as Chao pointed out[*], removing the vm_dead check would allow creating and running vCPUs in a dead VM, which is most definitely not desirable. Squashing the vCPU creation case is easy enough if we keep vm_dead but still generally allow ioctls, and it's probably worth doing that no matter what (to plug the hole where pending vCPU creations could succeed):
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d477a7fda0ae..941d2c32b7dc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c@@ -4207,6 +4207,11 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id) mutex_lock(&kvm->lock); + if (kvm->vm_dead) { + r = -EIO; + goto unlock_vcpu_destroy; + } + if (kvm_get_vcpu_by_id(kvm, id)) { r = -EEXIST; goto unlock_vcpu_destroy;
And then to ensure vCPUs can't do anything, check KVM_REQ_VM_DEAD after acquiring vcpu->mutex.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6c07dd423458..883077eee4ce 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c@@ -4433,6 +4433,12 @@ static long kvm_vcpu_ioctl(struct file *filp, if (mutex_lock_killable(&vcpu->mutex)) return -EINTR; + + if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu)) { + r = -EIO; + goto out; + } + switch (ioctl) { case KVM_RUN: { struct pid *oldpid;
That should address all TDVPS paths (I hope), and I _think_ would address all MMU-related paths as well? E.g. prefault requires a vCPU. Disallowing (most) vCPU ioctls but not all VM ioctls on vm_dead isn't great ABI (understatement), but I think we need/want the above changes even if we keep the general vm_dead restriction. And given the extremely ad hoc behavior of taking kvm->lock for VM ioctls, trying to enforce vm_dead for "all" VM ioctls seems like a fool's errand. So I'm leaning toward keeping "KVM: Reject ioctls only if the VM is bugged, not simply marked dead" (with a different shortlog+changelog), but keeping vm_dead (and not introducing kvm_tdx.vm_terminated). Thoughts? [*] https://lore.kernel.org/all/aIlzeT+yFG2Tvb3%2F@intel.com (local)