Thread (58 messages) 58 messages, 10 authors, 2022-08-25

Re: [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory

From: Chao Peng <hidden>
Date: 2022-06-14 07:31:40
Also in: kvm, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel

On Thu, Jun 09, 2022 at 08:29:06PM +0000, Sean Christopherson wrote:
On Wed, Jun 08, 2022, Vishal Annapurve wrote:
quoted
...
quoted
With this patch series, it's actually even not possible for userspace VMM
to allocate private page by a direct write, it's basically unmapped from
there. If it really wants to, it should so something special, by intention,
that's basically the conversion, which we should allow.
A VM can pass GPA backed by private pages to userspace VMM and when
Userspace VMM accesses the backing hva there will be pages allocated
to back the shared fd causing 2 sets of pages backing the same guest
memory range.
quoted
Thanks for bringing this up. But in my mind I still think userspace VMM
can do and it's its responsibility to guarantee that, if that is hard
required.
That was my initial reaction too, but there are unfortunate side effects to punting
this to userspace. 
quoted
By design, userspace VMM is the decision-maker for page
quoted
conversion and has all the necessary information to know which page is
shared/private. It also has the necessary knobs to allocate/free the
physical pages for guest memory. Definitely, we should make userspace
VMM more robust.
Making Userspace VMM more robust to avoid double allocation can get
complex, it will have to keep track of all in-use (by Userspace VMM)
shared fd memory to disallow conversion from shared to private and
will have to ensure that all guest supplied addresses belong to shared
GPA ranges.
IMO, the complexity argument isn't sufficient justfication for introducing new
kernel functionality.  If multiple processes are accessing guest memory then there
already needs to be some amount of coordination, i.e. it can't be _that_ complex.

My concern with forcing userspace to fully handle unmapping shared memory is that
it may lead to additional performance overhead and/or noisy neighbor issues, even
if all guests are well-behaved.

Unnmapping arbitrary ranges will fragment the virtual address space and consume
more memory for all the result VMAs.  The extra memory consumption isn't that big
of a deal, and it will be self-healing to some extent as VMAs will get merged when
the holes are filled back in (if the guest converts back to shared), but it's still
less than desirable.

More concerning is having to take mmap_lock for write for every conversion, which
is very problematic for configurations where a single userspace process maps memory
belong to multiple VMs.  Unmapping and remapping on every conversion will create a
bottleneck, especially if a VM has sub-optimal behavior and is converting pages at
a high rate.

One argument is that userspace can simply rely on cgroups to detect misbehaving
guests, but (a) those types of OOMs will be a nightmare to debug and (b) an OOM
kill from the host is typically considered a _host_ issue and will be treated as
a missed SLO.

An idea for handling this in the kernel without too much complexity would be to
add F_SEAL_FAULT_ALLOCATIONS (terrible name) that would prevent page faults from
allocating pages, i.e. holes can only be filled by an explicit fallocate().  Minor
faults, e.g. due to NUMA balancing stupidity, and major faults due to swap would
still work, but writes to previously unreserved/unallocated memory would get a
SIGSEGV on something it has mapped.  That would allow the userspace VMM to prevent
unintentional allocations without having to coordinate unmapping/remapping across
multiple processes.
Since this is mainly for shared memory and the motivation is catching
misbehaved access, can we use mprotect(PROT_NONE) for this? We can mark
those range backed by private fd as PROT_NONE during the conversion so
subsequence misbehaved accesses will be blocked instead of causing double
allocation silently.

Chao
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help