Re: [PATCH v3] mm/filemap: Allow arch to request folio size for exec memory

From: Zi Yan <ziy@nvidia.com>
Date: 2025-03-28 13:33:06
Also in: linux-fsdevel, linux-mm, lkml

On 28 Mar 2025, at 9:09, Ryan Roberts wrote:

On 27/03/2025 20:07, Zi Yan wrote:

quoted

On 27 Mar 2025, at 12:44, Matthew Wilcox wrote:

quoted

On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote:

quoted

So let's special-case the read(ahead) logic for executable mappings. The
trade-off is performance improvement (due to more efficient storage of
the translations in iTLB) vs potential read amplification (due to
reading too much data around the fault which won't be used), and the
latter is independent of base page size. I've chosen 64K folio size for
arm64 which benefits both the 4K and 16K base page size configs and
shouldn't lead to any read amplification in practice since the old
read-around path was (usually) reading blocks of 128K. I don't
anticipate any write amplification because text is always RO.

Is there not also the potential for wasted memory due to ELF alignment?
Kalesh talked about it in the MM BOF at the same time that Ted and I
were discussing it in the FS BOF.  Some coordination required (like
maybe Kalesh could have mentioned it to me rathere than assuming I'd be
there?)

quoted

+#define arch_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)

I don't think the "arch" really adds much value here.

#define exec_folio_order()	get_order(SZ_64K)

How about AMD’s PTE coalescing, which does PTE compression at
16KB or 32KB level? It covers 4 16KB and 2 32KB, at least it will
not hurt AMD PTE coalescing. Starting with 64KB across all arch
might be simpler to see the performance impact. Just a comment,
no objection. :)

exec_folio_order() is defined per-architecture and SZ_64K is the arm64 preferred
size. At the moment x86 is not opted in, but they could choose to opt in with
32K (or whatever else makese sense) if the HW supports coalescing.

Oh, I missed that part. I thought, since arch_ is not there, it was the same
for all arch.

I'm not sure if you thought this was global and are arguing against that, or if
you are arguing for it to be global because it will more easily show us
performance regressions earlier if x86 is doing this too?

I thought it was global. It might be OK to set it global and let different arch
to optimize it as it rolls out. Opt-in might be "never" until someone looks
into it, but if it is global and it changes performance, people will notice
and look into it.

--
Best Regards,
Yan, Zi

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help