Re: [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64
From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Date: 2026-06-10 06:29:53
Also in:
linux-mm, lkml
Subsystem:
linux for powerpc (32-bit and 64-bit), memory management - swap, the rest · Maintainers:
Madhavan Srinivasan, Michael Ellerman, Andrew Morton, Chris Li, Kairui Song, Linus Torvalds
YoungJun Park [off-list ref] writes:
On Tue, Jun 09, 2026 at 06:49:30PM +0530, Ritesh Harjani (IBM) wrote:quoted
On PowerPC Book3S64, MMU is selected at runtime, so macros like PMD_SHIFT are effectively runtime variables in the Book3S64 code. THP swap code uses these macros for e.g. to size some of its array data structures based on PMD_ORDER. This patch series makes that usage dependent on the runtime variable. Sayali did some performance runs of this on Book3S64 with Radix and it gives 40-50% performance improvement. We also plan to run it with Hash, will soon update the results. Note that this patch series is based out of linux-next (next-20260608). Ritesh Harjani (IBM) (4): include/linux/swap.h: Remove unused leftovers mm, swap: make SWAPFILE_CLUSTER runtime mm, swap: make SWAP_NR_ORDERS runtime powerpc: Kconfig: Enable THP_SWAP on Book3S64 arch/powerpc/platforms/Kconfig.cputype | 1 + include/linux/swap.h | 17 +--- mm/swap.h | 5 +- mm/swap_table.h | 6 +- mm/swapfile.c | 132 ++++++++++++++++++------- 5 files changed, 106 insertions(+), 55 deletions(-) -- 2.39.5Hello!
Thanks for taking a look at this.
Instead of making SWAP_NR_ORDERS fully runtime, could we set it to the max PMD_ORDER possible on PowerPC Book3S64 as a compile-time constant in the swap.h ifdef block? (My assumtion is PMD_ORDER max not too big.) I think the general runtime version adds cost. It impacts all other archs. percpu_swap_cluster needs a runtime alloc, the si/offset and nonfull/frag arrays become separate pointers, and some accesses get one more indirection. And for nr_orders=1, the allocation itself is just waste. With a compile-time possible max constant, the only downside is some acceptable amount of wasted bytes per CPU / per device on Book3S64 (the unused entries in the swap offset cache and the nonfull/frag lists), with no perf impact. the perf improvement comes from THP swap itself, right? Other arches see no impact at all.
I looked into the memory waste comparison between static v/s runtime alloc. And the wastage for per-cpu alloc data structures (with Radix MMU) will be 0, because we use kcalloc_node() which will use kmalloc-64 slab. So slab padding would anyway add some memory waste. So it is as good as using static arrays with some max PMD_ORDER for the percpu_swap_cluster. For the other lists you mentioned, it anyways adds a onetime negligible cost which isn't worth for making SWAP_NR_ORDERS runtime.
patch 2 looks fine as is. SWAPFILE_CLUSTER backs much bigger per-cluster arrays, so runtime sizing makes sense there, and it looks like no impact to other arches or the current code.
yup. That make sense. So, unless someone else raises any objection - I will give this a try instead of patch-3 in this series and will get back with v2.
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index e67e64ac6e8c..57abd8b2c9a1 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h@@ -204,6 +204,9 @@ extern unsigned long __pmd_frag_size_shift; #define MAX_PTRS_PER_PGD (1 << (H_PGD_INDEX_SIZE > RADIX_PGD_INDEX_SIZE ? \ H_PGD_INDEX_SIZE : RADIX_PGD_INDEX_SIZE)) +#define ARCH_MAX_PMD_ORDER ((H_PTE_INDEX_SIZE > RADIX_PTE_INDEX_SIZE) ? \ + H_PTE_INDEX_SIZE : RADIX_PTE_INDEX_SIZE) + /* PMD_SHIFT determines what a second-level page table entry can map */ #define PMD_SHIFT (PAGE_SHIFT + PTE_INDEX_SIZE) #define PMD_SIZE (1UL << PMD_SHIFT)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46c25523d7b8..5f1451f8f266 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h@@ -224,10 +224,14 @@ enum { #define SWAP_ENTRY_INVALID 0 #ifdef CONFIG_THP_SWAP +#ifdef ARCH_MAX_PMD_ORDER +#define SWAP_NR_ORDERS (ARCH_MAX_PMD_ORDER + 1) +#else #define SWAP_NR_ORDERS (PMD_ORDER + 1) +#endif /* ARCH_MAX_PMD_ORDER */ #else #define SWAP_NR_ORDERS 1 -#endif +#endif /* CONFIG_THP_SWAP */ -ritesh