Thread (154 messages) 154 messages, 12 authors, 1d ago

Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION

From: Ackerley Tng <hidden>
Date: 2026-06-26 15:28:35
Also in: kvm, linux-coco, linux-doc, linux-kselftest, linux-mm, lkml

Yan Zhao [off-list ref] writes:
On Thu, Jun 25, 2026 at 05:07:23PM -0700, Ackerley Tng wrote:
quoted
Yan Zhao [off-list ref] writes:
quoted
On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
quoted
Sean Christopherson [off-list ref] writes:
quoted
On Tue, Jun 23, 2026, Yan Zhao wrote:
quoted
On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
quoted
On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
quoted
On Mon, Jun 22, 2026, Yan Zhao wrote:
quoted
On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
quoted
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ffe9d0db58c59..56d10333c61a7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
 		return -EIO;

-	if (!src_page)
-		return -EOPNOTSUPP;
+	if (!src_page) {
+		if (!gmem_in_place_conversion)
When userspace turns on gmem_in_place_conversion while creating guest_memfd
without the MMAP flag, the absence of src_page should still be treated as an
error.
Why MMAP?
Hmm, I was showing a scenario that in-place conversion couldn't occur.
I didn't mean that with the MMAP flag, mmap() and user write must occur.
quoted
Shouldn't this be a general "if (!src_page && !up-to-date)"?  Just
because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
and written memory.  And when write() lands, MMAP wouldn't be necessary to
initialize the memory.
Do you mean using up-to-date flag as below?
Yes?  I didn't actually look at the implementation details.
quoted
quoted
if (!src_page) {
	src_page = pfn_to_page(pfn);
	if (!folio_test_uptodate(page_folio(src_page)))
		return -EOPNOTSUPP;
}
Yan is right that with the earlier patch "Zero page while getting pfn",
folio_test_uptodate() here will always return true.

Actually, this is an alternative fix for the issue Sashiko pointed out
on v7 where userspace can do a populate() (either TDX or SNP) without
first allocating the page, with src_address == NULL, and leak
uninitialized memory into the guest.

Advantage of using the uptodate check in populate: if the host never
allocates the page, populate doesn't incur zeroing before writing the
page anyway in populate().

Disadvantage: Both TDX and SNP will have to implement this uptodate
check. guest_memfd can't check centrally because for SNP, for a
PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
firmware will zero and there's no leakage of uninitialized host memory?
Another disadvantage: the uptodate flag is per-folio. What if the folio
is only partially initialized by the userspace especially after huge page is
supported?
Good point on huge pages!

The uptodate flag on the folio in guest_memfd means "this folio has been
written to". As of now (before patch at [1]), this happens when

+ folio is zeroed on first use by userspace
+ folio is zeroed on first use of the guest
+ folio is populated

When huge pages are supported, the folio can't partially be initialized?

On allocation, if any part is shared, we split the page. The parts are
separate folios that have their own uptodate flags.

On splitting, if the huge page is uptodate, the split pages will also be
uptodate. If the huge page is not uptodate, the split pages won't be
uptodate, but that's ok since they will be marked uptodate on first use.

On merging, the non-uptodate parts have to be zeroed and then marked
If that's true, it would be good.
quoted
uptodate. Any parts that are in use would have been marked uptodate
already, so there's no overwriting data that is in use. I'll need to
think more about when it's safe to zero.

I'm still on the fence between the two options

1. Using uptodate check in populate to reject src_pages that have never
   been written to or
2. Always zero before populate
2 does not work?
The flow is
1. mmap gmem_fd, make GFN shared, and write initial content.
2. convert GFN to private
3. invoke ioctl to trigger populate.
This flow is correct, is what users of in-place conversion should do.

"Always" is the wrong word, I should have said "zero if not uptodate
before populate", as in, with patch at [1].

By doing the zeroing in __kvm_gmem_get_pfn instead, by the time populate
gets the pfn, the page would be zeroed, either because userspace faulted
it in, and the zeroing happened in kvm_gmem_fault_user_mapping(), or if
userspace never faulted it in, the zeroing would happen because
populate() allocated the page.
quoted
but whether the uptodate flag is per-folio or not doesn't affect these
two options in terms of fixing the leak of uninitialized host memory,
right?
yes, provided "On merging, the non-uptodate parts have to be zeroed and then
marked uptodate".
Thank you so much for bringing this up, I hadn't considered this
before. I'll do that when I get to guest_memfd hugepage restructuring.
quoted
quoted
quoted
quoted
quoted
Another concern with this fix is that:
commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
folio uptodate before reaching post_populate().

[1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/ (local)
quoted
One concern is that TDX now does not much care about the up-to-date flag since
TDX doesn't rely on the flag to clear pages on conversions.
I'm not sure if the flag can be reliably checked in this case. e.g.,
now the whole folio is marked up-to-date even if only part of it is faulted by
user access.
Ensuring that the up-to-date flag works correctly with huge page support seems
to have more effort than introducing a dedicated flag for TDX.
quoted
quoted
Additionally, to properly enable in-place copying for the TDX initial memory
region, userspace must not only specify source_addr to NULL, but also follow
a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
1. create guest_memfd with MMAP flag
2. mmap the guest_memfd.
3. convert the initial memory range to shared.
4. copy initial content to the source page.
5. convert the initial memory range to private
6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
7. do not unmap the source backend.

So, would it be reasonable to introduce a dedicated flag that allows userspace
to explicitly opt into the in-place copy functionality? e.g.,
Why?  It's userspace's responsibility to get the above right.  If userspace fails
to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
Yan, is your concern that userspace forgot to update the code and
forgets to provide a src_page, and if we keep the "Zero page while
Yes. Previously, it would be rejected after GUP fails.
I see, didn't realize previously it would be rejected because GUP
fails. GUP failed because it wasn't faulted into the host?
GUP fails if 0 is not a valid user address.
But GUP would not fail if 0 is a valid address. e.g., in below scenario:

#include <sys/mman.h>
#include <stdio.h>
int main(void)
{
        void *p=mmap((void*)0,4096,PROT_READ|PROT_WRITE, MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,-1,0);
        if (p==MAP_FAILED) {
                perror("mmap");
                return 1;
        }
        *(char*)0='Y';
        printf("addr0=%p val=%c\n",p,*(char*)0);
        return 0;
}

quoted
That's kind of orthogonal, I don't think GUP fail leading to rejecting
populate was meant to help userspace catch these issues. GUP would also
fail if the user did mmap(), write to it, unmap using
madvise(MADV_DONTNEED), then forget and pass 0 as src_address.
The original uAPI did not explicitly define 0 as an invalid uaddr. Whether 0 was
rejected depended on whether the user mmap()'d address 0. If 0 was a valid
mapping, populate() could proceed.

commit 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating
guest memory") changed the behavior though. It would return -EOPNOTSUPP for a 0
uaddr.
I see, I only looked at this after commit 2a62345b3052.
But if a user configures 0 uaddr as valid, writes to it, and then passes 0 as
source_addr(not from gmem), I'm not sure if it's good for the kernel to silently
treat 0 uaddr as an identifier for in-place copy from the private PFN in gmem.
I'd say the original uAPI perhaps just didn't document 0 as an
unsupported uaddr. Given that commit 2a62345b3052 already merged, uAPI
was perhaps accidentally changed and no customer complained, I think we
can move forward with 0 as an invalid src_address? I wouldn't think
anyone relies on 0 intentionally being a valid address.

I could document that, if it helps?
quoted
quoted
quoted
getting pfn" patch, ends up with the guest silently having a zero page?
I think that would be found quite early in userspace VMM testing...
I actually encountered this during testing this patch.
I update most code path to follow this sequence. However, still some corner ones
for TDVF HOB, which are less obvious and harder to update.
The TD just booted up and hang silently.
I think this is just the life of a close-to-hardware software engineer
:P no errors, got stuck somewhere, root cause is some unitialized
thing.
quoted
quoted
quoted
quoted
quoted
I mean if userspace specifies a NULL source_addr by mistake, it's better for
kernel to detect this mistake, similar to how it validates whether source_addr
is PAGE_ALIGNED.
The alignment case is different.  If userspace provides an unaligned value, KVM
*can't* do what userspace is asking because hardware and thus KVM only supports
converting on page boundaries.

For a NULL source, KVM can still do what userspace is asking.  Rejecting userspace's
request would then be making assumptions about what userspace wants.
Also, +1 on this, what if userspace, knowing that pages are zeroed on
allocation, actually wants to rely on that to get a zero page in the guest?
What if 0 uaddr is a valid address? :)
quoted
quoted
quoted
quoted
Since userspace already needs to perform additional steps to enable in-place
copy, specifying a dedicated flag to indicate that the NULL source_addr is
intentional seems like a reasonable burden.
I don't see how it adds any value.  I wouldn't be at all surprised if most VMMs
just wen up with code that does:

	if (in-place) {
		src = NULL;
		flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
	}
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help