Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe <jgg@nvidia.com>
Date: 2025-08-26 16:20:26
Also in:
linux-api, linux-fsdevel, linux-mm, lkml
On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
+ /*
+ * Most of the space should be taken by preserved folios. So take its
+ * size, plus a page for other properties.
+ */
+ fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
+ if (!fdt) {
+ err = -ENOMEM;
+ goto err_unpin;
+ }This doesn't seem to have any versioning scheme, it really should..
+ err = fdt_property_placeholder(fdt, "folios", preserved_size,
+ (void **)&preserved_folios);
+ if (err) {
+ pr_err("Failed to reserve folios property in FDT: %s\n",
+ fdt_strerror(err));
+ err = -ENOMEM;
+ goto err_free_fdt;
+ }Yuk. This really wants some luo helper 'luo alloc array' 'luo restore array' 'luo free array' Which would get a linearized list of pages in the vmap to hold the array and then allocate some structure to record the page list and return back the u64 of the phys_addr of the top of the structure to store in whatever. Getting fdt to allocate the array inside the fds is just not going to work for anything of size.
+ for (; i < nr_pfolios; i++) {
+ const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
+ phys_addr_t phys;
+ u64 index;
+ int flags;
+
+ if (!pfolio->foliodesc)
+ continue;
+
+ phys = PFN_PHYS(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
+ folio = kho_restore_folio(phys);
+ if (!folio) {
+ pr_err("Unable to restore folio at physical address: %llx\n",
+ phys);
+ goto put_file;
+ }
+ index = pfolio->index;
+ flags = PRESERVED_FOLIO_FLAGS(pfolio->foliodesc);
+
+ /* Set up the folio for insertion. */
+ /*
+ * TODO: Should find a way to unify this and
+ * shmem_alloc_and_add_folio().
+ */
+ __folio_set_locked(folio);
+ __folio_set_swapbacked(folio);
+ ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping));
+ if (ret) {
+ pr_err("shmem: failed to charge folio index %d: %d\n",
+ i, ret);
+ goto unlock_folio;
+ }[..]
+ folio_add_lru(folio); + folio_unlock(folio); + folio_put(folio); + }
Probably some consolidation will be needed to make this less
duplicated..
But overall I think just using the memfd_luo_preserved_folio as the
serialization is entirely file, I don't think this needs anything more
complicated.
What it does need is an alternative to the FDT with versioning.
Which seems to me to be entirely fine as:
struct memfd_luo_v0 {
__aligned_u64 size;
__aligned_u64 pos;
__aligned_u64 folios;
};
struct memfd_luo_v0 memfd_luo_v0 = {.size = size, pos = file->f_pos, folios = folios};
luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
Which also shows the actual data needing to be serialized comes from
more than one struct and has to be marshaled in code, somehow, to a
single struct.
Then I imagine a fairly simple forwards/backwards story. If something
new is needed that is non-optional, lets say you compress the folios
list to optimize holes:
struct memfd_luo_v1 {
__aligned_u64 size;
__aligned_u64 pos;
__aligned_u64 folios_list_with_holes;
};
Obviously a v0 kernel cannot parse this, but in this case a v1 aware
kernel could optionally duplicate and write out the v0 format as well:
luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);
Then the rule is fairly simple, when the sucessor kernel goes to
deserialize it asks luo for the versions it supports:
if (luo_restore_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1))
restore_v1(&memfd_luo_v1)
else if (luo_restore_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0))
restore_v0(&memfd_luo_v0)
else
luo_failure("Do not understand this");
luo core just manages this list of versioned data per serialized
object. There is only one version per object.
Jason