Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
From: Andy Lutomirski <luto@kernel.org>
Date: 2022-06-14 21:00:03
Also in:
kvm, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel
On Tue, Jun 14, 2022 at 12:09 PM Sean Christopherson [off-list ref] wrote:
On Tue, Jun 14, 2022, Andy Lutomirski wrote:quoted
On Tue, Jun 14, 2022 at 12:32 AM Chao Peng [off-list ref] wrote:quoted
On Thu, Jun 09, 2022 at 08:29:06PM +0000, Sean Christopherson wrote:quoted
On Wed, Jun 08, 2022, Vishal Annapurve wrote: One argument is that userspace can simply rely on cgroups to detect misbehaving guests, but (a) those types of OOMs will be a nightmare to debug and (b) an OOM kill from the host is typically considered a _host_ issue and will be treated as a missed SLO. An idea for handling this in the kernel without too much complexity would be to add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults from allocating pages, i.e. holes can only be filled by an explicit fallocate(). Minor faults, e.g. due to NUMA balancing stupidity, and major faults due to swap would still work, but writes to previously unreserved/unallocated memory would get a SIGSEGV on something it has mapped. That would allow the userspace VMM to prevent unintentional allocations without having to coordinate unmapping/remapping across multiple processes.Since this is mainly for shared memory and the motivation is catching misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark those range backed by private fd as PROT_NONE during the conversion so subsequence misbehaved accesses will be blocked instead of causing double allocation silently.PROT_NONE, a.k.a. mprotect(), has the same vma downsides as munmap().quoted
This patch series is fairly close to implementing a rather more efficient solution. I'm not familiar enough with hypervisor userspace to really know if this would work, but: What if shared guest memory could also be file-backed, either in the same fd or with a second fd covering the shared portion of a memslot? This would allow changes to the backing store (punching holes, etc) to be some without mmap_lock or host-userspace TLB flushes? Depending on what the guest is doing with its shared memory, userspace might need the memory mapped or it might not.That's what I'm angling for with the F_SEAL_FAULT_ALLOCATIONS idea. The issue, unless I'm misreading code, is that punching a hole in the shared memory backing store doesn't prevent reallocating that hole on fault, i.e. a helper process that keeps a valid mapping of guest shared memory can silently fill the hole. What we're hoping to achieve is a way to prevent allocating memory without a very explicit action from userspace, e.g. fallocate().
Ah, I misunderstood. I thought your goal was to mmap it and prevent page faults from allocating. It is indeed the case (and has been since before quite a few of us were born) that a hole in a sparse file is logically just a bunch of zeros. A way to make a file for which a hole is an actual hole seems like it would solve this problem nicely. It could also be solved more specifically for KVM by making sure that the private/shared mode that userspace programs is strict enough to prevent accidental allocations -- if a GPA is definitively private, shared, neither, or (potentially, on TDX only) both, then a page that *isn't* shared will never be accidentally allocated by KVM. If the shared backing is not mmapped, it also won't be accidentally allocated by host userspace on a stray or careless write. --Andy