Thread (160 messages) 160 messages, 12 authors, 8h ago

Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION

From: Ackerley Tng <hidden>
Date: 2026-06-30 00:00:05
Also in: kvm, linux-coco, linux-doc, linux-kselftest, linux-mm, lkml

Yan Zhao [off-list ref] writes:
On Fri, Jun 26, 2026 at 08:28:32AM -0700, Ackerley Tng wrote:
quoted
Yan Zhao [off-list ref] writes:
quoted
On Thu, Jun 25, 2026 at 05:07:23PM -0700, Ackerley Tng wrote:
quoted
Yan Zhao [off-list ref] writes:
quoted
On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
quoted
Sean Christopherson [off-list ref] writes:
quoted
On Tue, Jun 23, 2026, Yan Zhao wrote:
quoted
On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
quoted
On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
quoted
On Mon, Jun 22, 2026, Yan Zhao wrote:
quoted
On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
quoted
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ffe9d0db58c59..56d10333c61a7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
 		return -EIO;

-	if (!src_page)
-		return -EOPNOTSUPP;
+	if (!src_page) {
+		if (!gmem_in_place_conversion)
When userspace turns on gmem_in_place_conversion while creating guest_memfd
without the MMAP flag, the absence of src_page should still be treated as an
error.
Why MMAP?
Hmm, I was showing a scenario that in-place conversion couldn't occur.
I didn't mean that with the MMAP flag, mmap() and user write must occur.
quoted
Shouldn't this be a general "if (!src_page && !up-to-date)"?  Just
because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
and written memory.  And when write() lands, MMAP wouldn't be necessary to
initialize the memory.
Do you mean using up-to-date flag as below?
Yes?  I didn't actually look at the implementation details.
quoted
quoted
if (!src_page) {
	src_page = pfn_to_page(pfn);
	if (!folio_test_uptodate(page_folio(src_page)))
		return -EOPNOTSUPP;
}
Yan is right that with the earlier patch "Zero page while getting pfn",
folio_test_uptodate() here will always return true.

Actually, this is an alternative fix for the issue Sashiko pointed out
on v7 where userspace can do a populate() (either TDX or SNP) without
first allocating the page, with src_address == NULL, and leak
uninitialized memory into the guest.

Advantage of using the uptodate check in populate: if the host never
allocates the page, populate doesn't incur zeroing before writing the
page anyway in populate().

Disadvantage: Both TDX and SNP will have to implement this uptodate
check. guest_memfd can't check centrally because for SNP, for a
PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
firmware will zero and there's no leakage of uninitialized host memory?
Another disadvantage: the uptodate flag is per-folio. What if the folio
is only partially initialized by the userspace especially after huge page is
supported?
Good point on huge pages!

The uptodate flag on the folio in guest_memfd means "this folio has been
written to". As of now (before patch at [1]), this happens when

+ folio is zeroed on first use by userspace
+ folio is zeroed on first use of the guest
+ folio is populated

When huge pages are supported, the folio can't partially be initialized?

On allocation, if any part is shared, we split the page. The parts are
separate folios that have their own uptodate flags.

On splitting, if the huge page is uptodate, the split pages will also be
uptodate. If the huge page is not uptodate, the split pages won't be
uptodate, but that's ok since they will be marked uptodate on first use.

On merging, the non-uptodate parts have to be zeroed and then marked
If that's true, it would be good.
quoted
uptodate. Any parts that are in use would have been marked uptodate
already, so there's no overwriting data that is in use. I'll need to
think more about when it's safe to zero.

I'm still on the fence between the two options

1. Using uptodate check in populate to reject src_pages that have never
   been written to or
2. Always zero before populate
2 does not work?
The flow is
1. mmap gmem_fd, make GFN shared, and write initial content.
2. convert GFN to private
3. invoke ioctl to trigger populate.
This flow is correct, is what users of in-place conversion should do.

"Always" is the wrong word, I should have said "zero if not uptodate
before populate", as in, with patch at [1].

By doing the zeroing in __kvm_gmem_get_pfn instead, by the time populate
gets the pfn, the page would be zeroed, either because userspace faulted
it in, and the zeroing happened in kvm_gmem_fault_user_mapping(), or if
userspace never faulted it in, the zeroing would happen because
populate() allocated the page.
I see.
quoted
quoted
quoted
but whether the uptodate flag is per-folio or not doesn't affect these
two options in terms of fixing the leak of uninitialized host memory,
right?
yes, provided "On merging, the non-uptodate parts have to be zeroed and then
marked uptodate".
Thank you so much for bringing this up, I hadn't considered this
before. I'll do that when I get to guest_memfd hugepage restructuring.
quoted
quoted
quoted
quoted
quoted
quoted
Another concern with this fix is that:
commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
folio uptodate before reaching post_populate().

[1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/ (local)
quoted
One concern is that TDX now does not much care about the up-to-date flag since
TDX doesn't rely on the flag to clear pages on conversions.
I'm not sure if the flag can be reliably checked in this case. e.g.,
now the whole folio is marked up-to-date even if only part of it is faulted by
user access.
Ensuring that the up-to-date flag works correctly with huge page support seems
to have more effort than introducing a dedicated flag for TDX.
quoted
quoted
Additionally, to properly enable in-place copying for the TDX initial memory
region, userspace must not only specify source_addr to NULL, but also follow
a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
1. create guest_memfd with MMAP flag
2. mmap the guest_memfd.
3. convert the initial memory range to shared.
4. copy initial content to the source page.
5. convert the initial memory range to private
6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
7. do not unmap the source backend.

So, would it be reasonable to introduce a dedicated flag that allows userspace
to explicitly opt into the in-place copy functionality? e.g.,
Why?  It's userspace's responsibility to get the above right.  If userspace fails
to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
Yan, is your concern that userspace forgot to update the code and
forgets to provide a src_page, and if we keep the "Zero page while
Yes. Previously, it would be rejected after GUP fails.
I see, didn't realize previously it would be rejected because GUP
fails. GUP failed because it wasn't faulted into the host?
GUP fails if 0 is not a valid user address.
But GUP would not fail if 0 is a valid address. e.g., in below scenario:

#include <sys/mman.h>
#include <stdio.h>
int main(void)
{
        void *p=mmap((void*)0,4096,PROT_READ|PROT_WRITE, MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,-1,0);
        if (p==MAP_FAILED) {
                perror("mmap");
                return 1;
        }
        *(char*)0='Y';
        printf("addr0=%p val=%c\n",p,*(char*)0);
        return 0;
}

quoted
That's kind of orthogonal, I don't think GUP fail leading to rejecting
populate was meant to help userspace catch these issues. GUP would also
fail if the user did mmap(), write to it, unmap using
madvise(MADV_DONTNEED), then forget and pass 0 as src_address.
The original uAPI did not explicitly define 0 as an invalid uaddr. Whether 0 was
rejected depended on whether the user mmap()'d address 0. If 0 was a valid
mapping, populate() could proceed.

commit 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating
guest memory") changed the behavior though. It would return -EOPNOTSUPP for a 0
uaddr.
I see, I only looked at this after commit 2a62345b3052.
quoted
But if a user configures 0 uaddr as valid, writes to it, and then passes 0 as
source_addr(not from gmem), I'm not sure if it's good for the kernel to silently
treat 0 uaddr as an identifier for in-place copy from the private PFN in gmem.
I'd say the original uAPI perhaps just didn't document 0 as an
unsupported uaddr. Given that commit 2a62345b3052 already merged, uAPI
was perhaps accidentally changed and no customer complained, I think we
can move forward with 0 as an invalid src_address? I wouldn't think
anyone relies on 0 intentionally being a valid address.

I could document that, if it helps?
What about just documenting that 0 is an unsupported uaddr which will be
re-purposed as an indicator to use the target pfn as the source, regardless of
whether gmem_in_place_conversion is true? i.e.,

if (!src_page)
	src_page = pfn_to_page(pfn);

I don't get why the two scenarios should be treated differently:
1. gmem_in_place_conversion==true, shared memory is not from gmem
2. gmem_in_place_conversion==false, shared memory is not from gmem

In both case, a 0 uaddr could be mapped to a valid page not from gmem.
This is true, but this check isn't about whether the page is from gmem.
So why not update the uAPI to handle both cases consistently? :)
Wait, but before this series, if region.src_address = 0, src_page = NULL
and that's not supported so it returns -EOPNOTSUPP.

If that's dropped, then suddenly if region.src_address = 0 and
!gmem_in_place_conversion, tdx_gmem_post_populate() will now load the
memory (zeroed) after [1] into the guest? I don't think we want to
change that behavior.

I could document that 0 is an unsupported uaddr only for TDX, and only
when gmem_in_place_conversion = false.

Since it is unsupported only when gmem_in_place_conversion = false, the
check two lines marked with <<==== can't go away?

	if (!src_page) {
		if (!gmem_in_place_conversion)  <<====
			return -EOPNOTSUPP;     <<====

		src_page = pfn_to_page(pfn);
	}

Also, for SNP, src_address == 0 is permitted (and desired, I believe, to
avoid a pointless kernel memcpy) if the type of population is
KVM_SEV_SNP_PAGE_TYPE_ZERO.
quoted
quoted
quoted
quoted
quoted
getting pfn" patch, ends up with the guest silently having a zero page?
I think that would be found quite early in userspace VMM testing...
[...snip...]
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help