Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Ackerley Tng <hidden>
Date: 2026-06-26 15:28:35
Also in:
kvm, linux-coco, linux-doc, linux-kselftest, linux-mm, lkml
Yan Zhao [off-list ref] writes:
On Thu, Jun 25, 2026 at 05:07:23PM -0700, Ackerley Tng wrote:quoted
Yan Zhao [off-list ref] writes:quoted
On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:quoted
Sean Christopherson [off-list ref] writes:quoted
On Tue, Jun 23, 2026, Yan Zhao wrote:quoted
On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:quoted
On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:quoted
On Mon, Jun 22, 2026, Yan Zhao wrote:quoted
On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:quoted
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index ffe9d0db58c59..56d10333c61a7 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c@@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm)) return -EIO; - if (!src_page) - return -EOPNOTSUPP; + if (!src_page) { + if (!gmem_in_place_conversion)When userspace turns on gmem_in_place_conversion while creating guest_memfd without the MMAP flag, the absence of src_page should still be treated as an error.Why MMAP?Hmm, I was showing a scenario that in-place conversion couldn't occur. I didn't mean that with the MMAP flag, mmap() and user write must occur.quoted
Shouldn't this be a general "if (!src_page && !up-to-date)"? Just because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d and written memory. And when write() lands, MMAP wouldn't be necessary to initialize the memory.Do you mean using up-to-date flag as below?Yes? I didn't actually look at the implementation details.quoted
quoted
if (!src_page) { src_page = pfn_to_page(pfn); if (!folio_test_uptodate(page_folio(src_page))) return -EOPNOTSUPP; }Yan is right that with the earlier patch "Zero page while getting pfn", folio_test_uptodate() here will always return true. Actually, this is an alternative fix for the issue Sashiko pointed out on v7 where userspace can do a populate() (either TDX or SNP) without first allocating the page, with src_address == NULL, and leak uninitialized memory into the guest. Advantage of using the uptodate check in populate: if the host never allocates the page, populate doesn't incur zeroing before writing the page anyway in populate(). Disadvantage: Both TDX and SNP will have to implement this uptodate check. guest_memfd can't check centrally because for SNP, for a PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since firmware will zero and there's no leakage of uninitialized host memory?Another disadvantage: the uptodate flag is per-folio. What if the folio is only partially initialized by the userspace especially after huge page is supported?Good point on huge pages! The uptodate flag on the folio in guest_memfd means "this folio has been written to". As of now (before patch at [1]), this happens when + folio is zeroed on first use by userspace + folio is zeroed on first use of the guest + folio is populated When huge pages are supported, the folio can't partially be initialized? On allocation, if any part is shared, we split the page. The parts are separate folios that have their own uptodate flags. On splitting, if the huge page is uptodate, the split pages will also be uptodate. If the huge page is not uptodate, the split pages won't be uptodate, but that's ok since they will be marked uptodate on first use. On merging, the non-uptodate parts have to be zeroed and then markedIf that's true, it would be good.quoted
uptodate. Any parts that are in use would have been marked uptodate already, so there's no overwriting data that is in use. I'll need to think more about when it's safe to zero. I'm still on the fence between the two options 1. Using uptodate check in populate to reject src_pages that have never been written to or 2. Always zero before populate2 does not work? The flow is 1. mmap gmem_fd, make GFN shared, and write initial content. 2. convert GFN to private 3. invoke ioctl to trigger populate.
This flow is correct, is what users of in-place conversion should do. "Always" is the wrong word, I should have said "zero if not uptodate before populate", as in, with patch at [1]. By doing the zeroing in __kvm_gmem_get_pfn instead, by the time populate gets the pfn, the page would be zeroed, either because userspace faulted it in, and the zeroing happened in kvm_gmem_fault_user_mapping(), or if userspace never faulted it in, the zeroing would happen because populate() allocated the page.
quoted
but whether the uptodate flag is per-folio or not doesn't affect these two options in terms of fixing the leak of uninitialized host memory, right?yes, provided "On merging, the non-uptodate parts have to be zeroed and then marked uptodate".
Thank you so much for bringing this up, I hadn't considered this before. I'll do that when I get to guest_memfd hugepage restructuring.
quoted
quoted
quoted
quoted
quoted
Another concern with this fix is that: commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the folio uptodate before reaching post_populate(). [1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/ (local)quoted
One concern is that TDX now does not much care about the up-to-date flag since TDX doesn't rely on the flag to clear pages on conversions. I'm not sure if the flag can be reliably checked in this case. e.g., now the whole folio is marked up-to-date even if only part of it is faulted by user access. Ensuring that the up-to-date flag works correctly with huge page support seems to have more effort than introducing a dedicated flag for TDX.quoted
quoted
Additionally, to properly enable in-place copying for the TDX initial memory region, userspace must not only specify source_addr to NULL, but also follow a specific sequence (where steps 1/2/3/7 are required only for in-place copy): 1. create guest_memfd with MMAP flag 2. mmap the guest_memfd. 3. convert the initial memory range to shared. 4. copy initial content to the source page. 5. convert the initial memory range to private 6. invoke ioctl KVM_TDX_INIT_MEM_REGION. 7. do not unmap the source backend. So, would it be reasonable to introduce a dedicated flag that allows userspace to explicitly opt into the in-place copy functionality? e.g.,Why? It's userspace's responsibility to get the above right. If userspace fails to provide a src_page when it doesn't want in-place copy, that's a userspace bug.Yan, is your concern that userspace forgot to update the code and forgets to provide a src_page, and if we keep the "Zero page whileYes. Previously, it would be rejected after GUP fails.I see, didn't realize previously it would be rejected because GUP fails. GUP failed because it wasn't faulted into the host?GUP fails if 0 is not a valid user address. But GUP would not fail if 0 is a valid address. e.g., in below scenario: #include <sys/mman.h> #include <stdio.h> int main(void) { void *p=mmap((void*)0,4096,PROT_READ|PROT_WRITE, MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,-1,0); if (p==MAP_FAILED) { perror("mmap"); return 1; } *(char*)0='Y'; printf("addr0=%p val=%c\n",p,*(char*)0); return 0; }quoted
That's kind of orthogonal, I don't think GUP fail leading to rejecting populate was meant to help userspace catch these issues. GUP would also fail if the user did mmap(), write to it, unmap using madvise(MADV_DONTNEED), then forget and pass 0 as src_address.The original uAPI did not explicitly define 0 as an invalid uaddr. Whether 0 was rejected depended on whether the user mmap()'d address 0. If 0 was a valid mapping, populate() could proceed. commit 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating guest memory") changed the behavior though. It would return -EOPNOTSUPP for a 0 uaddr.
I see, I only looked at this after commit 2a62345b3052.
But if a user configures 0 uaddr as valid, writes to it, and then passes 0 as source_addr(not from gmem), I'm not sure if it's good for the kernel to silently treat 0 uaddr as an identifier for in-place copy from the private PFN in gmem.
I'd say the original uAPI perhaps just didn't document 0 as an unsupported uaddr. Given that commit 2a62345b3052 already merged, uAPI was perhaps accidentally changed and no customer complained, I think we can move forward with 0 as an invalid src_address? I wouldn't think anyone relies on 0 intentionally being a valid address. I could document that, if it helps?
quoted
quoted
quoted
getting pfn" patch, ends up with the guest silently having a zero page? I think that would be found quite early in userspace VMM testing...I actually encountered this during testing this patch. I update most code path to follow this sequence. However, still some corner ones for TDVF HOB, which are less obvious and harder to update. The TD just booted up and hang silently.I think this is just the life of a close-to-hardware software engineer :P no errors, got stuck somewhere, root cause is some unitialized thing.quoted
quoted
quoted
quoted
quoted
I mean if userspace specifies a NULL source_addr by mistake, it's better for kernel to detect this mistake, similar to how it validates whether source_addr is PAGE_ALIGNED.The alignment case is different. If userspace provides an unaligned value, KVM *can't* do what userspace is asking because hardware and thus KVM only supports converting on page boundaries. For a NULL source, KVM can still do what userspace is asking. Rejecting userspace's request would then be making assumptions about what userspace wants.Also, +1 on this, what if userspace, knowing that pages are zeroed on allocation, actually wants to rely on that to get a zero page in the guest?What if 0 uaddr is a valid address? :)quoted
quoted
quoted
quoted
Since userspace already needs to perform additional steps to enable in-place copy, specifying a dedicated flag to indicate that the NULL source_addr is intentional seems like a reasonable burden.I don't see how it adds any value. I wouldn't be at all surprised if most VMMs just wen up with code that does: if (in-place) { src = NULL; flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION; }