Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE
From: Chao Peng <hidden>
Date: 2023-01-18 09:15:44
Also in:
kvm, linux-arch, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel
On Tue, Jan 17, 2023 at 07:35:58PM +0000, Sean Christopherson wrote:
On Tue, Jan 17, 2023, Chao Peng wrote:quoted
On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:quoted
On Fri, Dec 02, 2022, Chao Peng wrote:quoted
@@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu)) static_call(kvm_x86_update_cpu_dirty_logging)(vcpu); + + if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) { + vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;Synthesizing triple fault shutdown is not the right approach. Even with TDX's MCE "architecture" (heavy sarcasm), it's possible that host userspace and the guest have a paravirt interface for handling memory errors without killing the host.Agree shutdown is not the correct choice. I see you made below change: send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current) The MCE may happen in any thread than KVM thread, sending siginal to 'current' thread may not be the expected behavior.This is already true today, e.g. a #MC in memory that is mapped into the guest can be triggered by a host access. Hrm, but in this case we actually have a KVM instance, and we know that the #MC is relevant to the KVM instance, so I agree that signaling 'current' is kludgy.quoted
Also how userspace can tell is the MCE on the shared page or private page? Do we care?We care. I was originally thinking we could require userspace to keep track of things, but that's quite prescriptive and flawed, e.g. could race with conversions. One option would be to KVM_EXIT_MEMORY_FAULT, and then wire up a generic (not x86 specific) KVM request to exit to userspace, e.g. /* KVM_EXIT_MEMORY_FAULT */ struct { #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3) #define KVM_MEMORY_EXIT_FLAG_HW_ERROR (1ULL << 4) __u64 flags; __u64 gpa; __u64 size; } memory; But I'm not sure that's the correct approach. It kinda feels like we're reinventing the wheel. It seems like restrictedmem_get_page() _must_ be able to reject attempts to get a poisoned page, i.e. restrictedmem_get_page() should yield KVM_PFN_ERR_HWPOISON.
Yes, I see there is -EHWPOISON handling for hva_to_pfn() for shared memory. It makes sense doing similar for private page.
Assuming that's the case, then I believe KVM simply needs to zap SPTEs in response to an error notification in order to force vCPUs to fault on the poisoned page.
Agree, this is waht we should do anyway.
quoted
quoted
quoted
+ return -EINVAL; if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM) return -EINVAL; if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)@@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm, if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages) return -EINVAL; } else { /* Modify an existing slot. */ + /* Private memslots are immutable, they can only be deleted. */I'm 99% certain I suggested this, but if we're going to make these memslots immutable, then we should straight up disallow dirty logging, otherwise we'll end up with a bizarre uAPI.But in my mind dirty logging will be needed in the very short time, when live migration gets supported?Ya, but if/when live migration support is added, private memslots will no longer be immutable as userspace will want to enable dirty logging only when a VM is being migrated, i.e. something will need to change. Given that it looks like we have clear line of sight to SEV+UPM guests, my preference would be to allow toggling dirty logging from the get-go. It doesn't necessarily have to be in the first patch, e.g. KVM could initially reject KVM_MEM_LOG_DIRTY_PAGES + KVM_MEM_PRIVATE and then add support separately to make the series easier to review, test, and bisect. static int check_memory_region_flags(struct kvm *kvm, const struct kvm_userspace_memory_region2 *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; if (kvm_arch_has_private_mem(kvm) && ~(mem->flags & KVM_MEM_LOG_DIRTY_PAGES)) valid_flags |= KVM_MEM_PRIVATE;
Adding this limitation is OK to me. It's not too hard to remove it when live migration gets added.
... }quoted
quoted
quoted
+ if (mem->flags & KVM_MEM_PRIVATE) + return -EINVAL; if ((mem->userspace_addr != old->userspace_addr) || (npages != old->npages) || ((mem->flags ^ old->flags) & KVM_MEM_READONLY))@@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm, new->npages = npages; new->flags = mem->flags; new->userspace_addr = mem->userspace_addr; + if (mem->flags & KVM_MEM_PRIVATE) { + new->restricted_file = fget(mem->restricted_fd); + if (!new->restricted_file || + !file_is_restrictedmem(new->restricted_file)) { + r = -EINVAL; + goto out; + } + new->restricted_offset = mem->restricted_offset;I see you changed slot->restricted_offset type from loff_t to gfn_t and used pgoff_t when doing the restrictedmem_bind/unbind(). Using page index is reasonable KVM internally and sounds simpler than loff_t. But we also need initialize it to page index here as well as changes in another two cases. This is needed when restricted_offset != 0.Oof. I'm pretty sure I completely missed that loff_t is used for byte offsets, whereas pgoff_t is a frame index. Given that the restrictmem APIs take pgoff_t, I definitely think it makes sense to the index, but I'm very tempted to store pgoff_t instead of gfn_t, and name the field "index" to help connect the dots to the rest of kernel, where "pgoff_t index" is quite common. And looking at those bits again, we should wrap all of the restrictedmem fields with CONFIG_KVM_PRIVATE_MEM. It'll require minor tweaks to __kvm_set_memory_region(), but I think will yield cleaner code (and internal APIs) overall. And wrap the three fields in an anonymous struct? E.g. this is a little more versbose (restrictedmem instead restricted), but at first glance it doesn't seem to cause widespared line length issues. #ifdef CONFIG_KVM_PRIVATE_MEM struct { struct file *file; pgoff_t index; struct restrictedmem_notifier notifier; } restrictedmem; #endif
Looks better. Thanks, Chao
quoted
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 547b92215002..49e375e78f30 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h@@ -2364,8 +2364,7 @@ static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, kvm_pfn_t *pfn, int *order) { - pgoff_t index = gfn - slot->base_gfn + - (slot->restricted_offset >> PAGE_SHIFT); + pgoff_t index = gfn - slot->base_gfn + slot->restricted_offset; struct page *page; int ret;diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 01db35ddd5b3..7439bdcb0d04 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c@@ -935,7 +935,7 @@ static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot, pgoff_t start, pgoff_t end, gfn_t *gfn_start, gfn_t *gfn_end) { - unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT; + unsigned long base_pgoff = slot->restricted_offset; if (start > base_pgoff) *gfn_start = slot->base_gfn + start - base_pgoff;@@ -2275,7 +2275,7 @@ int __kvm_set_memory_region(struct kvm *kvm, r = -EINVAL; goto out; } - new->restricted_offset = mem->restricted_offset; + new->restricted_offset = mem->restricted_offset >> PAGE_SHIFT; } r = kvm_set_memslot(kvm, old, new, change);Chaoquoted
quoted
+ } + + new->kvm = kvm;Set this above, just so that the code flows better.