Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
From: Sean Christopherson <seanjc@google.com>
Date: 2022-06-14 19:09:21
Also in:
kvm, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel
On Tue, Jun 14, 2022, Andy Lutomirski wrote:
On Tue, Jun 14, 2022 at 12:32 AM Chao Peng [off-list ref] wrote:quoted
On Thu, Jun 09, 2022 at 08:29:06PM +0000, Sean Christopherson wrote:quoted
On Wed, Jun 08, 2022, Vishal Annapurve wrote: One argument is that userspace can simply rely on cgroups to detect misbehaving guests, but (a) those types of OOMs will be a nightmare to debug and (b) an OOM kill from the host is typically considered a _host_ issue and will be treated as a missed SLO. An idea for handling this in the kernel without too much complexity would be to add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults from allocating pages, i.e. holes can only be filled by an explicit fallocate(). Minor faults, e.g. due to NUMA balancing stupidity, and major faults due to swap would still work, but writes to previously unreserved/unallocated memory would get a SIGSEGV on something it has mapped. That would allow the userspace VMM to prevent unintentional allocations without having to coordinate unmapping/remapping across multiple processes.Since this is mainly for shared memory and the motivation is catching misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark those range backed by private fd as PROT_NONE during the conversion so subsequence misbehaved accesses will be blocked instead of causing double allocation silently.
PROT_NONE, a.k.a. mprotect(), has the same vma downsides as munmap().
This patch series is fairly close to implementing a rather more efficient solution. I'm not familiar enough with hypervisor userspace to really know if this would work, but: What if shared guest memory could also be file-backed, either in the same fd or with a second fd covering the shared portion of a memslot? This would allow changes to the backing store (punching holes, etc) to be some without mmap_lock or host-userspace TLB flushes? Depending on what the guest is doing with its shared memory, userspace might need the memory mapped or it might not.
That's what I'm angling for with the F_SEAL_FAULT_ALLOCATIONS idea. The issue, unless I'm misreading code, is that punching a hole in the shared memory backing store doesn't prevent reallocating that hole on fault, i.e. a helper process that keeps a valid mapping of guest shared memory can silently fill the hole. What we're hoping to achieve is a way to prevent allocating memory without a very explicit action from userspace, e.g. fallocate().