Re: [PATCH v1 11/26] x86/sev: Invalidate pages from the direct map when adding them to the RMP table
From: Michael Roth <hidden>
Date: 2024-01-16 16:51:46
Also in:
kvm, linux-crypto, linux-mm, lkml
On Tue, Jan 16, 2024 at 10:19:09AM -0600, Michael Roth wrote:
I did some performance tests which do seem to indicate that
pre-splitting the directmap to 4K can be substantially improve certain
SNP guest workloads. This test involves running a single 1TB SNP guest
with 128 vCPUs running "stress --vm 128 --vm-bytes 5G --vm-keep" to
rapidly fault in all of its memory via lazy acceptance, and then
measuring the rate that gmem pages are being allocated on the host by
monitoring "FileHugePages" from /proc/meminfo to get some rough gauge
of how quickly a guest can fault in it's initial working set prior to
reaching steady state. The data is a bit noisy but seems to indicate
significant improvement by taking the directmap updates out of the
lazy acceptance path, and I would only expect that to become more
significant as you scale up the number of guests / vCPUs.
# Average fault-in rate across 3 runs, measured in GB/s
unpinned | pinned to NUMA node 0
DirectMap4K 12.9 | 12.1
stddev 2.2 | 1.3
DirectMap2M+split 8.0 | 8.9
stddev 1.3 | 0.8
The downside of course is potential impact for non-SNP workloads
resulting from splitting the directmap. Mike Rapoport's numbers make
me feel a little better about it, but I don't think they apply directly
to the notion of splitting the entire directmap. It's Even he LWN article
summarizes:
"The conclusion from all of this, Rapoport continued, was that
direct-map fragmentation just does not matter — for data access, at
least. Using huge-page mappings does still appear to make a difference
for memory containing the kernel code, so allocator changes should
focus on code allocations — improving the layout of allocations for
loadable modules, for example, or allowing vmalloc() to allocate huge
pages for code. But, for kernel-data allocations, direct-map
fragmentation simply appears to not be worth worrying about."
So at the very least, if we went down this path, we would be worth
investigating the following areas in addition to general perf testing:
1) Only splitting directmap regions corresponding to kernel-allocatable
*data* (hopefully that's even feasible...)
2) Potentially deferring the split until an SNP guest is actually
run, so there isn't any impact just from having SNP enabled (though
you still take a hit from RMP checks in that case so maybe it's not
worthwhile, but that itself has been noted as a concern for users
so it would be nice to not make things even worse).There's another potential area of investigation I forgot to mention that doesn't involve pre-splitting the directmap. It makes use of the fact that the kernel should never be accessing a 2MB mapping that overlaps with private guest memory if the backing PFN for the guest memory is a 2MB page. Since there's no chance for overlap (well, maybe via a 1GB directmap entry, but not as dramatic a change to force those to 2M), there's no need to actually split the directmap entry in these cases since they won't result in unexpected RMP faults. So if pre-splitting the directmap ends up having too many downsides, then there may still some potential for optimizing the current approach to a fair degree. -Mike