Thread (11 messages) 11 messages, 5 authors, 21d ago

Re: [RFC 0/4] mm, swap: Enable THP SWAP for PowerPC Book3S64

From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Date: 2026-06-10 06:29:53
Also in: linux-mm, lkml
Subsystem: linux for powerpc (32-bit and 64-bit), memory management - swap, the rest · Maintainers: Madhavan Srinivasan, Michael Ellerman, Andrew Morton, Chris Li, Kairui Song, Linus Torvalds

YoungJun Park [off-list ref] writes:
On Tue, Jun 09, 2026 at 06:49:30PM +0530, Ritesh Harjani (IBM) wrote:
quoted
On PowerPC Book3S64, MMU is selected at runtime, so macros like PMD_SHIFT are
effectively runtime variables in the Book3S64 code. THP swap code uses these
macros for e.g. to size some of its array data structures based on PMD_ORDER.
This patch series makes that usage dependent on the runtime variable.

Sayali did some performance runs of this on Book3S64 with Radix and it gives
40-50% performance improvement. We also plan to run it with Hash, will soon
update the results.

Note that this patch series is based out of linux-next (next-20260608).

Ritesh Harjani (IBM) (4):
  include/linux/swap.h: Remove unused leftovers
  mm, swap: make SWAPFILE_CLUSTER runtime
  mm, swap: make SWAP_NR_ORDERS runtime
  powerpc: Kconfig: Enable THP_SWAP on Book3S64

 arch/powerpc/platforms/Kconfig.cputype |   1 +
 include/linux/swap.h                   |  17 +---
 mm/swap.h                              |   5 +-
 mm/swap_table.h                        |   6 +-
 mm/swapfile.c                          | 132 ++++++++++++++++++-------
 5 files changed, 106 insertions(+), 55 deletions(-)

--
2.39.5
Hello!
Thanks for taking a look at this.
Instead of making SWAP_NR_ORDERS fully runtime, could we set it to the max
PMD_ORDER possible on PowerPC Book3S64 as a compile-time constant in the
swap.h ifdef block? (My assumtion is PMD_ORDER max not too big.)

I think the general runtime version adds cost. It impacts all other archs.
percpu_swap_cluster needs a runtime alloc,
the si/offset and nonfull/frag arrays become separate pointers, and some
accesses get one more indirection. And for nr_orders=1, the allocation
itself is just waste. 

With a compile-time possible max constant, the only downside is some acceptable amount of
wasted bytes per CPU / per device on Book3S64 (the unused entries in the swap
offset cache and the nonfull/frag lists), with no perf impact. the perf
improvement comes from THP swap itself, right? Other arches see no
impact at all.
I looked into the memory waste comparison between static v/s runtime
alloc. And the wastage for per-cpu alloc data structures (with Radix
MMU) will be 0, because we use kcalloc_node() which will use kmalloc-64
slab. So slab padding would anyway add some memory waste. So it is as
good as using static arrays with some max PMD_ORDER for the
percpu_swap_cluster.

For the other lists you mentioned, it anyways adds a onetime negligible
cost which isn't worth for making SWAP_NR_ORDERS runtime. 
patch 2 looks fine as is. SWAPFILE_CLUSTER backs much bigger per-cluster
arrays, so runtime sizing makes sense there, and it looks like no impact to
other arches or the current code.
yup. That make sense. 


So, unless someone else raises any objection - I will give this a try
instead of patch-3 in this series and will get back with v2.

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index e67e64ac6e8c..57abd8b2c9a1 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -204,6 +204,9 @@ extern unsigned long __pmd_frag_size_shift;
 #define MAX_PTRS_PER_PGD       (1 << (H_PGD_INDEX_SIZE > RADIX_PGD_INDEX_SIZE ? \
                                       H_PGD_INDEX_SIZE : RADIX_PGD_INDEX_SIZE))

+#define ARCH_MAX_PMD_ORDER ((H_PTE_INDEX_SIZE > RADIX_PTE_INDEX_SIZE) ? \
+                               H_PTE_INDEX_SIZE : RADIX_PTE_INDEX_SIZE)
+
 /* PMD_SHIFT determines what a second-level page table entry can map */
 #define PMD_SHIFT      (PAGE_SHIFT + PTE_INDEX_SIZE)
 #define PMD_SIZE       (1UL << PMD_SHIFT)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46c25523d7b8..5f1451f8f266 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -224,10 +224,14 @@ enum {
 #define SWAP_ENTRY_INVALID     0

 #ifdef CONFIG_THP_SWAP
+#ifdef ARCH_MAX_PMD_ORDER
+#define SWAP_NR_ORDERS         (ARCH_MAX_PMD_ORDER + 1)
+#else
 #define SWAP_NR_ORDERS         (PMD_ORDER + 1)
+#endif /* ARCH_MAX_PMD_ORDER */
 #else
 #define SWAP_NR_ORDERS         1
-#endif
+#endif /* CONFIG_THP_SWAP */


-ritesh
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help