Re: [PATCH v3] mm/filemap: Allow arch to request folio size for exec memory
From: Zi Yan <ziy@nvidia.com>
Date: 2025-03-28 13:33:06
Also in:
linux-fsdevel, linux-mm, lkml
On 28 Mar 2025, at 9:09, Ryan Roberts wrote:
On 27/03/2025 20:07, Zi Yan wrote:quoted
On 27 Mar 2025, at 12:44, Matthew Wilcox wrote:quoted
On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote:quoted
So let's special-case the read(ahead) logic for executable mappings. The trade-off is performance improvement (due to more efficient storage of the translations in iTLB) vs potential read amplification (due to reading too much data around the fault which won't be used), and the latter is independent of base page size. I've chosen 64K folio size for arm64 which benefits both the 4K and 16K base page size configs and shouldn't lead to any read amplification in practice since the old read-around path was (usually) reading blocks of 128K. I don't anticipate any write amplification because text is always RO.Is there not also the potential for wasted memory due to ELF alignment? Kalesh talked about it in the MM BOF at the same time that Ted and I were discussing it in the FS BOF. Some coordination required (like maybe Kalesh could have mentioned it to me rathere than assuming I'd be there?)quoted
+#define arch_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)I don't think the "arch" really adds much value here. #define exec_folio_order() get_order(SZ_64K)How about AMD’s PTE coalescing, which does PTE compression at 16KB or 32KB level? It covers 4 16KB and 2 32KB, at least it will not hurt AMD PTE coalescing. Starting with 64KB across all arch might be simpler to see the performance impact. Just a comment, no objection. :)exec_folio_order() is defined per-architecture and SZ_64K is the arm64 preferred size. At the moment x86 is not opted in, but they could choose to opt in with 32K (or whatever else makese sense) if the HW supports coalescing.
Oh, I missed that part. I thought, since arch_ is not there, it was the same for all arch.
I'm not sure if you thought this was global and are arguing against that, or if you are arguing for it to be global because it will more easily show us performance regressions earlier if x86 is doing this too?
I thought it was global. It might be OK to set it global and let different arch to optimize it as it rolls out. Opt-in might be "never" until someone looks into it, but if it is global and it changes performance, people will notice and look into it. -- Best Regards, Yan, Zi