Re: [PATCH v3 00/22] Add support for shared PTEs across processes

From: Kalesh Singh <hidden>
Date: 2026-02-25 22:54:02
Also in: linux-arch, linux-mm, lkml

On Mon, Feb 23, 2026 at 11:59 AM [off-list ref] wrote:


On 2/23/26 9:43 AM, Kalesh Singh wrote:

quoted

On Sat, Feb 21, 2026 at 4:40 AM Pedro Falcato [off-list ref] wrote:

quoted

On Fri, Feb 20, 2026 at 01:35:58PM -0800, Kalesh Singh wrote:

quoted

On Tue, Aug 19, 2025 at 6:57 PM Anthony Yznaga
[off-list ref] wrote:

quoted

Memory pages shared between processes require page table entries
(PTEs) for each process. Each of these PTEs consume some of
the memory and as long as the number of mappings being maintained
is small enough, this space consumed by page tables is not
objectionable. When very few memory pages are shared between
processes, the number of PTEs to maintain is mostly constrained by
the number of pages of memory on the system. As the number of shared
pages and the number of times pages are shared goes up, amount of
memory consumed by page tables starts to become significant. This
issue does not apply to threads. Any number of threads can share the
same pages inside a process while sharing the same PTEs. Extending
this same model to sharing pages across processes can eliminate this
issue for sharing across processes as well.

<snip>

Hi Anthony,

Thanks for continuing to push this forward, and apologies for joining
this discussion late. I am likely missing some context from the
various previous iterations of this feature, but I'd like to throw
another use case into the mix to be considered around the design of
the sharing API.

We are exploring a similar optimization for Android to reduce page
table overhead. In Android, we preload many ELF mappings in the Zygote
process to help application launch times. Since the Zygote model is
fork-but-no-exec, all applications inherit these mappings, which can
result in upwards of 200 MB of redundant page table overhead per
device.

This can be solved by simply not using the Zygote model :p Or perhaps
MADV_DONTNEED/straight up unmapping libraries you don't need in the child's
side.

I think that's a separate topic, but that model is used on billions of
client devices :) The common runtime for apps and other core system
code is preloaded to significantly reduce app startup latencies.

quoted

I believe that managing a pseudo-filesystem (msharefs) and mapping via
ioctl during process creation could introduce overhead that impacts
app startup latency. Ideally, child apps shouldn't be aware of this
sharing or need to manage the pseudo-filesystem on their end. To
achieve this "transparent" sharing, I would prefer Khalid's previous
API from his 2022 RFC [1]. By attaching the shared mm directly to the
file's address_space and exposing a MAP_SHARED_PT flag, child apps
could transparently inherit the shared page tables during fork().

So, we've discussed this before. I initially liked this idea a lot more.
However, there are a couple of problems here:

1) mshare (as in the mshare feature) isn't really aiming for transparent here.
There is e.g a specific need to setup an mshare region, with a few files/anon
there, and then later mprotect/munmap parts of the region - and have it apply
on every process that has it mapped. This is why we're aiming for different
system calls (not ioctls anymore), doing munmap(mshare_reg, 4096) is ambiguous
as to whether you want to unmap the mshare VMA, or a VMA inside the mshare mm.

Since we are interested in sharing text here, how does this play with
stuff like symbolization for call stacks? I believe this is another
reason where we might want to avoid mapping the pseudo mshare file
wrapper?

I haven't explored shared text, yet. There may be dragons there.

quoted

2) Sharing the page table at all (even worse so, Transparently(tm)) is a huge
pain. TLB shootdown becomes much harder, and rmap as-is isn't suited to deal
with this case. The way things are going with mshare, the container mm will
have one single entry in rmap, and then actually doing the shootdown is a
huuuuge pain (which, fwiw, will probably need a per-mshare TLB workaround),
because you need to find out and shoot down _every_ mm that has these tables

I agree the TLB shootdowns would be a pain. Perhaps, if there was a
concept of a shared ASID/PCID in the hardware, that would make things
less so ...

That would certainly help. sparc64 has a secondary context, but that
doesn't do us any good here. :-)

quoted

mapped. And then, naturally, since you're sharing page tables, doing A/D bit
collection on these becomes extremely useless - and that will naturally pose
problems to the reclaim process if you abuse it.

I think in the use case I described, it would mostly be sharing
MAP_PRIVATE stuff, and the access bit should still apply for global
reclaim. However, I agree it becomes difficult to reason especially if
you throw memcgs into the mix.

mshare won't support mapping objects in it with MAP_PRIVATE. Sharing
PTEs to memory that can be COW'd is problematic. If it's something that
can be adapted to use MAP_SHARED then maybe things can work.

I can see how mapping .text and .rodata as MAP_SHARED could
technically work, assuming the sharing process strictly mseals them to
guarantee they remain immutable. However, RELRO (.data.rel.ro) is a
different story. It must initially be mapped MAP_PRIVATE and writable
so that the dynamic linker can resolve relocations. Because these
modified pages cannot be written back to the backing file, they become
private anonymous pages.

If there was a way to allow an initially RW MAP_PRIVATE mapping to
resolve its relocations, be write-protected, mseal'd, and then have
its page tables shared, that would solve the RELRO issue. In the
Android Zygote model, this works perfectly because relocations resolve
before forking, meaning the resolutions are identical for all
children.

But how would we express this in the msharefs model? Transitioning a
post-CoW MAP_PRIVATE VMA into a shared page table structure seems
fundamentally at odds with a strictly MAP_SHARED file-backed
pseudo-filesystem approach.

As for memcgs, the current idea is to have an owner associated with an
mshare region. Currently this is the process that creates the region.
Mappings in an mshare region will be evaluated against the mem cgroup
the owner is a part of.

I'd also like to think a bit about what happens to other standard
memory metrics. Should accounting that actively walks the page tables
(like /proc/pid/smaps) still work correctly and see the mappings?

What happens with the counter-based metrics tied directly to the
mm_struct rss stats?

Since msharefs manages its own detached mm_struct, should we include
those RSS counts across all sharing processes when reporting in
/proc/*/status? Or will we need to introduce a new UAPI to
independently expose and understand the RSS of each msharefs
mm_struct?

Thanks,
Kalesh

Thanks,
Anthony

quoted

Thanks,
Kalesh

quoted

3) other misc problems that make it hard to work transparently (VMA alignment,
levels which you may or may not want to share, you need to revisit most page
table walkers in the kernel to get a completely transparent feature, etc)

quoted

Regarding David's and Matthew's discussion on VMA-modifying functions,
I would lean towards the standard VMA manipulating APIs should be
preferred over custom ioctls to preserve transparency for user-space.
Perhaps whether or not these modifications persist across all sharing
processes needs to be configurable? It seems that for database
workloads, having the updates reflected everywhere would be the
desired behavior. In the use case described for Android, we don't want
apps to be able to modify these shared ELF mappings. To handle this,
it's likely we would do something like mseal() the VMAs in the dynamic
loader before forking.

mshare_mseal!

quoted

Perhaps we could decouple the core sharing logic from the sharing API
itself? Since the sharing interface seems one of  the main areas where
we don't have a good consensus yet, perhaps we could land the core
sharing logic first. Keeping the core infrastructure generic would

I think the core infrastructure is relatively generic (at least the
small core mm modifications to get this to even work) already, but
perhaps Anthony can comment on that.

--
Pedro

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help