Thread (58 messages) 58 messages, 10 authors, 2022-08-25

Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory

From: Michael Roth <hidden>
Date: 2022-06-24 13:02:12
Also in: kvm, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel

On Fri, Jun 24, 2022 at 04:54:26PM +0800, Chao Peng wrote:
On Thu, Jun 23, 2022 at 05:59:49PM -0500, Michael Roth wrote:
quoted
On Fri, May 20, 2022 at 06:31:02PM +0000, Sean Christopherson wrote:
quoted
On Fri, May 20, 2022, Andy Lutomirski wrote:
quoted
The alternative would be to have some kind of separate table or bitmap (part
of the memslot?) that tells KVM whether a GPA should map to the fd.

What do you all think?
My original proposal was to have expolicit shared vs. private memslots, and punch
holes in KVM's memslots on conversion, but due to the way KVM (and userspace)
handle memslot updates, conversions would be painfully slow.  That's how we ended
up with the current propsoal.

But a dedicated KVM ioctl() to add/remove shared ranges would be easy to implement
and wouldn't necessarily even need to interact with the memslots.  It could be a
consumer of memslots, e.g. if we wanted to disallow registering regions without an
associated memslot, but I think we'd want to avoid even that because things will
get messy during memslot updates, e.g. if dirty logging is toggled or a shared
memory region is temporarily removed then we wouldn't want to destroy the tracking.

I don't think we'd want to use a bitmap, e.g. for a well-behaved guest, XArray
should be far more efficient.

One benefit to explicitly tracking this in KVM is that it might be useful for
software-only protected VMs, e.g. KVM could mark a region in the XArray as "pending"
based on guest hypercalls to share/unshare memory, and then complete the transaction
when userspace invokes the ioctl() to complete the share/unshare.
Another upside to implementing a KVM ioctl is basically the reverse of the
discussion around avoiding double-allocations: *supporting* double-allocations.

One thing I noticed while testing SNP+UPM support is a fairly dramatic
slow-down with how it handles OVMF, which does some really nasty stuff
with DMA where it takes 1 or 2 pages and flips them between
shared/private on every transaction. Obviously that's not ideal and
should be fixed directly at some point, but it's something that exists in the
wild and might not be the only such instance where we need to deal with that
sort of usage pattern. 

With the current implementation, one option I had to address this was to
disable hole-punching in QEMU when doing shared->private conversions:

Boot time from 1GB guest:
                               SNP:   32s
                           SNP+UPM: 1m43s
  SNP+UPM (disable shared discard): 1m08s

Of course, we don't have the option of disabling discard/hole-punching
for private memory to see if we get similar gains there, since that also
doubles as the interface for doing private->shared conversions.
Private should be the same, minus time consumed for private memory, the
data should be close to SNP case. You can't try that in current version
due to we rely on the existence of the private page to tell a page is
private.
quoted
A separate
KVM ioctl to decouple these 2 things would allow for that, and allow for a
way for userspace to implement things like batched/lazy-discard of
previously-converted pages to deal with cases like these.
The planned ioctl includes two responsibilities:
  - Mark the range as private/shared
  - Zap the existing SLPT mapping for the range

Whether doing the hole-punching or not on the fd is unrelated to this
ioctl, userspace has freedom to do that or not. Since we don't reply on
the fact that private memoy should have been allocated, we can support
lazy faulting and don't need explicit fallocate(). That means, whether
the memory is discarded or not in the memory backing store is not
required by KVM, but be a userspace option.
Nice, that sounds promising.
quoted
Another motivator for these separate ioctl is that, since we're considering
'out-of-band' interactions with private memfd where userspace might
erroneously/inadvertently do things like double allocations, another thing it
might do is pre-allocating pages in the private memfd prior to associating
the memfd with a private memslot. Since the notifiers aren't registered until
that point, any associated callbacks that would normally need to be done as
part of those fallocate() notification would be missed unless we do something
like 'replay' all the notifications once the private memslot is registered and
associating with a memfile notifier. But that seems a bit ugly, and I'm not
sure how well that would work. This also seems to hint at this additional
'conversion' state being something that should be owned and managed directly
by KVM rather than hooking into the allocations.
Right, once we move the private/shared state into KVM then we don't rely
on those callbacks so the 'replay' thing is unneeded. fallocate()
notification is useless for sure, invalidate() is likely still needed,
just like the invalidate for mmu_notifier to bump the mmu_seq and do the
zap.
Ok, yah, makes sense that we'd still up needing the invalidation hooks.
quoted
It would also nicely solve the question of how to handle in-place
encryption, since unlike userspace, KVM is perfectly capable of copying
data from shared->private prior to conversion / guest start, and
disallowing such things afterward. Would just need an extra flag basically.
Agree it's possible to do additional copy during the conversion but I'm
not so confident this is urgent and the right API. Currently TDX does
not have this need. Maybe as the first step just add the conversion
itself. Adding additional feature like this can always be possible
whenever we are clear.
That seems fair. In the meantime we can adopt the approach proposed by
Sean and Vishal[1] and handle it directly in the relevant SNP KVM ioctls.

If we end up keeping that approach we'll probably want to make sure these
KVM-driven 'implicit' conversions are documented in the KVM/SNP API so that
userspace can account for it in it's view of what's private/shared. In this
case at least it's pretty obvious, just thinking of when other archs and
VMMs utilizing this more.

Thanks!

-Mike

[1] https://lore.kernel.org/kvm/20220524205646.1798325-4-vannapurve@google.com/T/#m1e9bb782b1bea66c36ae7c4c9f4f0c35c2d7e338 (local)
Thanks,
Chao
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help