Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory
From: Chao Peng <hidden>
Date: 2022-06-08 05:11:01
Also in:
kvm, linux-api, linux-fsdevel, linux-mm, lkml, qemu-devel
On Tue, Jun 07, 2022 at 05:55:46PM -0700, Marc Orr wrote:
On Tue, Jun 7, 2022 at 12:01 AM Chao Peng [off-list ref] wrote:quoted
On Mon, Jun 06, 2022 at 01:09:50PM -0700, Vishal Annapurve wrote:quoted
quoted
Private memory map/unmap and conversion --------------------------------------- Userspace's map/unmap operations are done by fallocate() ioctl on the backing store fd. - map: default fallocate() with mode=0. - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE. The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap secondary MMU page tables.....quoted
QEMU: https://github.com/chao-p/qemu/tree/privmem-v6 An example QEMU command line for TDX test: -object tdx-guest,id=tdx \ -object memory-backend-memfd-private,id=ram1,size=2G \ -machine q35,kvm-type=tdx,pic=no,kernel_irqchip=split,memory-encryption=tdx,memory-backend=ram1There should be more discussion around double allocation scenarios when using the private fd approach. A malicious guest or buggy userspace VMM can cause physical memory getting allocated for both shared (memory accessible from host) and private fds backing the guest memory. Userspace VMM will need to unback the shared guest memory while handling the conversion from shared to private in order to prevent double allocation even with malicious guests or bugs in userspace VMM.I don't know how malicious guest can cause that. The initial design of this serie is to put the private/shared memory into two different address spaces and gives usersapce VMM the flexibility to convert between the two. It can choose respect the guest conversion request or not.For example, the guest could maliciously give a device driver a private page so that a host-side virtual device will blindly write the private page.
With this patch series, it's actually even not possible for userspace VMM to allocate private page by a direct write, it's basically unmapped from there. If it really wants to, it should so something special, by intention, that's basically the conversion, which we should allow.
quoted
It's possible for a usrspace VMM to cause double allocation if it fails to call the unback operation during the conversion, this may be a bug or not. Double allocation may not be a wrong thing, even in conception. At least TDX allows you to use half shared half private in guest, means both shared/private can be effective. Unbacking the memory is just the current QEMU implementation choice.Right. But the idea is that this patch series should accommodate all of the CVM architectures. Or at least that's what I know was envisioned last time we discussed this topic for SNP [*].
AFAICS, this series should work for both TDX and SNP, and other CVM architectures. I don't see where TDX can work but SNP cannot, or I missed something here?
Regardless, it's important to ensure that the VM respects its memory budget. For example, within Google, we run VMs inside of containers. So if we double allocate we're going to OOM. This seems acceptable for an early version of CVMs. But ultimately, I think we need a more robust way to ensure that the VM operates within its memory container. Otherwise, the OOM is going to be hard to diagnose and distinguish from a real OOM.
Thanks for bringing this up. But in my mind I still think userspace VMM can do and it's its responsibility to guarantee that, if that is hard required. By design, userspace VMM is the decision-maker for page conversion and has all the necessary information to know which page is shared/private. It also has the necessary knobs to allocate/free the physical pages for guest memory. Definitely, we should make userspace VMM more robust. Chao
[*] https://lore.kernel.org/all/20210820155918.7518-1-brijesh.singh@amd.com/ (local)quoted
Chaoquoted
Options to unback shared guest memory seem to be: 1) madvise(.., MADV_DONTNEED/MADV_REMOVE) - This option won't stop kernel from backing the shared memory on subsequent write accesses 2) fallocate(..., FALLOC_FL_PUNCH_HOLE...) - For file backed shared guest memory, this option still is similar to madvice since this would still allow shared memory to get backed on write accesses 3) munmap - This would give away the contiguous virtual memory region reservation with holes in the guest backing memory, which might make guest memory management difficult. 4) mprotect(... PROT_NONE) - This would keep the virtual memory address range backing the guest memory preserved ram_block_discard_range_fd from reference implementation: https://github.com/chao-p/qemu/tree/privmem-v6 seems to be relying on fallocate/madvise. Any thoughts/suggestions around better ways to unback the shared memory in order to avoid double allocation scenarios?I agree with Vishal. I think this patch set is making great progress. But the double allocation scenario seems like a high-level design issue that warrants more discussion.