Thread (153 messages) 153 messages, 23 authors, 2023-05-23

Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

From: Chao Peng <hidden>
Date: 2022-12-21 13:43:41
Also in: kvm, linux-arch, linux-doc, linux-fsdevel, linux-mm, lkml, qemu-devel

On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
quoted
On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
quoted
On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
quoted
quoted
[...]
quoted
+
+	/*
+	 * These pages are currently unmovable so don't place them into
movable
+	 * pageblocks (e.g. CMA and ZONE_MOVABLE).
+	 */
+	mapping = memfd->f_mapping;
+	mapping_set_unevictable(mapping);
+	mapping_set_gfp_mask(mapping,
+			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
non-
movable zones, but doesn't necessarily prevent page from being migrated.  My
first glance is you need to implement either a_ops->migrate_folio() or just
get_page() after faulting in the page to prevent.
The current api restrictedmem_get_page() already does this, after the
caller calling it, it holds a reference to the page. The caller then
decides when to call put_page() appropriately.
I tried to dig some history. Perhaps I am missing something, but it seems Kirill
said in v9 that this code doesn't prevent page migration, and we need to
increase page refcount in restrictedmem_get_page():

https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/ (local)

But looking at this series it seems restrictedmem_get_page() in this v10 is
identical to the one in v9 (except v10 uses 'folio' instead of 'page')?
restrictedmem_get_page() increases page refcount several versions ago so
no change in v10 is needed. You probably missed my reply:

https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/ (local)
But for non-restricted-mem case, it is correct for KVM to decrease page's
refcount after setting up mapping in the secondary mmu, otherwise the page will
be pinned by KVM for normal VM (since KVM uses GUP to get the page).
That's true. Actually even true for restrictedmem case, most likely we
will still need the kvm_release_pfn_clean() for KVM generic code. On one
side, other restrictedmem users like pKVM may not require page pinning
at all. On the other side, see below.
So what we are expecting is: for KVM if the page comes from restricted mem, then
KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
I argue that this page pinning (or page migration prevention) is not
tied to where the page comes from, instead related to how the page will
be used. Whether the page is restrictedmem backed or GUP() backed, once
it's used by current version of TDX then the page pinning is needed. So
such page migration prevention is really TDX thing, even not KVM generic
thing (that's why I think we don't need change the existing logic of
kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who
requires that) to increase/decrease the refcount when it populates/drops
the secure EPT entries? This is exactly what the current TDX code does:

get_page():
https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217

put_page():
https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334

Thanks,
Chao
quoted
The current solution is clear: unless we have better approach, we will
let restrictedmem user (KVM in this case) to hold the refcount to
prevent page migration.
OK.  Will leave to others :)
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help