Thread (164 messages) 164 messages, 20 authors, 2023-05-25

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

From: Kirill A . Shutemov <hidden>
Date: 2022-08-18 13:24:55
Also in: kvm, linux-doc, linux-fsdevel, linux-kselftest, linux-mm, lkml, qemu-devel

On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
On Wed, 6 Jul 2022, Chao Peng wrote:
quoted
This is the v7 of this series which tries to implement the fd-based KVM
guest private memory.
Here at last are my reluctant thoughts on this patchset.

fd-based approach for supporting KVM guest private memory: fine.

Use or abuse of memfd and shmem.c: mistaken.

memfd_create() was an excellent way to put together the initial prototype.

But since then, TDX in particular has forced an effort into preventing
(by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.

Are any of the shmem.c mods useful to existing users of shmem.c? No.
Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.

What use do you have for a filesystem here?  Almost none.
IIUC, what you want is an fd through which QEMU can allocate kernel
memory, selectively free that memory, and communicate fd+offset+length
to KVM.  And perhaps an interface to initialize a little of that memory
from a template (presumably copied from a real file on disk somewhere).

You don't need shmem.c or a filesystem for that!

If your memory could be swapped, that would be enough of a good reason
to make use of shmem.c: but it cannot be swapped; and although there
are some references in the mailthreads to it perhaps being swappable
in future, I get the impression that will not happen soon if ever.

If your memory could be migrated, that would be some reason to use
filesystem page cache (because page migration happens to understand
that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping
theoretically possible, but I'm not aware of any plans as of now.

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
Some of these impressions may come from earlier iterations of the
patchset (v7 looks better in several ways than v5).  I am probably
underestimating the extent to which you have taken on board other
usages beyond TDX and SEV private memory, and rightly want to serve
them all with similar interfaces: perhaps there is enough justification
for shmem there, but I don't see it.  There was mention of userfaultfd
in one link: does that provide the justification for using shmem?

I'm afraid of the special demands you may make of memory allocation
later on - surprised that huge pages are not mentioned already;
gigantic contiguous extents? secretmem removed from direct map?
The design allows for extension to hugetlbfs if needed. Combination of
MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
implications for shmem. It is going to be separate struct memfile_backing_store.

I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
to be movable if platform supports it and secretmem is not migratable by
design (without direct mapping fragmentations).
Here's what I would prefer, and imagine much easier for you to maintain;
but I'm no system designer, and may be misunderstanding throughout.

QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
the fallocate syscall interface itself) to allocate and free the memory,
ioctl for initializing some of it too.  KVM in control of whether that
fd can be read or written or mmap'ed or whatever, no need to prevent it
in shmem.c, no need for flags, seals, notifications to and fro because
KVM is already in control and knows the history.  If shmem actually has
value, call into it underneath - somewhat like SysV SHM, and /dev/zero
mmap, and i915/gem make use of it underneath.  If shmem has nothing to
add, just allocate and free kernel memory directly, recorded in your
own xarray.
I guess shim layer on top of shmem *can* work. I don't see immediately why
it would not. But I'm not sure it is right direction. We risk creating yet
another parallel VM with own rules/locking/accounting that opaque to
core-mm.

Note that on machines that run TDX guests such memory would likely be the
bulk of memory use. Treating it as a fringe case may bite us one day.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help