Re: [PATCH 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE

From: Lorenzo Stoakes <hidden>
Date: 2025-11-10 16:31:13
Also in: linux-mm, lkml

On Mon, Nov 10, 2025 at 01:22:16PM +0000, Lorenzo Stoakes wrote:

On Mon, Nov 10, 2025 at 06:37:58PM +0530, Garg, Shivank wrote:

quoted


On 11/10/2025 5:31 PM, Lorenzo Stoakes wrote:

quoted

On Mon, Nov 10, 2025 at 11:32:53AM +0000, Shivank Garg wrote:

quoted

When MADV_COLLAPSE is called on file-backed mappings (e.g., executable

quoted

---
Applies cleanly on:
6.18-rc5
mm-stable:e9a6fb0bc

Please base on mm-unstable. mm-stable is usually out of date until very close to
merge window.

I'm observing issues when testing with kselftest on mm-unstable and mm-new branches that prevent
proper testing for my patches:

On mm-unstable (without my patches):

# # running ./transhuge-stress -d 20
# # --------------------------------
# # TAP version 13
# # 1..1
# # transhuge-stress: allocate 220271 transhuge pages, using 440543 MiB virtual memory and 1720 MiB of ram


[  367.225667] RIP: 0010:swap_cache_get_folio+0x2d/0xc0
[  367.230635] Code: 00 00 48 89 f9 49 89 f9 48 89 fe 48 c1 e1 06 49 c1 e9 3a 48 c1 e9 0f 48 c1 e1 05 4a 8b 04 cd c0 2e 5b 99 48 8b 78 60 48 01 cf <48> 8b 47 08 48 85 c0 74 20 48 89 f2 81 e2 ff 01 00 00 48 8d 04 d0
[  367.249378] RSP: 0000:ffffcde32943fba8 EFLAGS: 00010282
[  367.254605] RAX: ffff8bd1668fdc00 RBX: 00007ffc15df5000 RCX: 00003fffffffffe0
[  367.261736] RDX: ffffffff995cb530 RSI: 0003ffffffffffff RDI: ffffcbd1560dffe0
[  367.268862] RBP: 0003ffffffffffff R08: ffffcde32943fc47 R09: 0000000000000000
[  367.275994] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  367.283129] R13: 0000000000000000 R14: ffff8bd1668fdc00 R15: 0000000000100cca
[  367.290260] FS:  00007ff600af5b80(0000) GS:ffff8c4e9ec7e000(0000) knlGS:0000000000000000
[  367.298344] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  367.304083] CR2: ffffcbd1560dffe8 CR3: 00000001280e9001 CR4: 0000000000770ef0
[  367.311216] PKRU: 55555554
[  367.313929] Call Trace:
[  367.316375]  <TASK>
[  367.318479]  __read_swap_cache_async+0x8e/0x1b0
[  367.323014]  swap_vma_readahead+0x23d/0x430
[  367.327198]  swapin_readahead+0xb0/0xc0
[  367.331039]  do_swap_page+0x5bc/0x1260
[  367.334789]  ? rseq_ip_fixup+0x6f/0x190
[  367.338631]  ? __pfx_default_wake_function+0x10/0x10
[  367.343596]  __handle_mm_fault+0x49a/0x760
[  367.347696]  handle_mm_fault+0x188/0x300
[  367.351620]  do_user_addr_fault+0x15b/0x6c0
[  367.355807]  exc_page_fault+0x60/0x100
[  367.359562]  asm_exc_page_fault+0x22/0x30
[  367.363574] RIP: 0033:0x7ff60091ba99
[  367.367153] Code: f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 40 c4 01 00 f3 0f 1e fa 80 3d b5 f5 0e 00 00 74 13 31 c0 0f 05 48 3d 00 f0 ff ff 77 4f <c3> 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75
[  367.385897] RSP: 002b:00007ffc15df1118 EFLAGS: 00010203
[  367.391124] RAX: 0000000000000001 RBX: 000055941fb672a0 RCX: 00007ff60091ba91
[  367.398256] RDX: 0000000000000001 RSI: 000055941fb813e0 RDI: 0000000000000000
[  367.405387] RBP: 00007ffc15df21e0 R08: 0000000000000000 R09: 0000000000000007
[  367.412513] R10: 000055941fb97cb0 R11: 0000000000000246 R12: 000055941fb813e0
[  367.419646] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  367.426781]  </TASK>
[  367.428970] Modules linked in: xfrm_user xfrm_algo xt_addrtype xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables overlay bridge stp llc cfg80211 rfkill binfmt_misc ipmi_ssif amd_atl intel_rapl_msr intel_rapl_common wmi_bmof amd64_edac edac_mce_amd mgag200 rapl drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper acpi_cpufreq i2c_piix4 ptdma k10temp i2c_smbus wmi acpi_power_meter ipmi_si acpi_ipmi ipmi_devintf ipmi_msghandler sg dm_multipath drm fuse dm_mod nfnetlink ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 kvm_amd sd_mod ahci nvme libahci kvm libata nvme_core tg3 ccp megaraid_sas irqbypass
[  367.497528] CR2: ffffcbd1560dffe8
[  367.500846] ---[ end trace 0000000000000000 ]---

Yikes, oopsies!

I'll try running tests locally on threadripper, but ran tests against yours
previously and seemed fine, strange. Maybe fixed since but let me try, maybe
because swap is not enabled locally for me?

Likely this actually...

I have tried on swap-enabled setup and no issue with mm-unstable.

So this is odd, I know you have limited time (_totally sympathise_) but is it at
all possible if you get a moment to bisect against tip mm-unstable/mm-new?

Obviously we want to make sure buggy swap code doesn't get merged to mainline!

quoted



-----------------
On mm-new (without my patches):

[  394.144770] get_swap_device: Bad swap offset entry 3ffffffffffff

dmesg | grep "get_swap_device: Bad swap offset entry" | wc -l
359


Additionally, kexec triggers an oops and crash during swapoff:


         Deactivating swap   704.854238] BUG: unable to handle page fault for address: ffffcbe2de8dffe8
[  704.861524] #PF: supervisor read access in kernel mode
;39mswap.img.swa[  704.866666] #PF: error_code(0x0000) - not-present page
[  704.873253] PGD 0 P4D 0
p - /swap.im[  704.875790] Oops: Oops: 0000 [#1] SMP NOPTI
g...
[  704.881354] CPU: 122 UID: 0 PID: 107680 Comm: swapoff Kdump: loaded Not tainted 6.18.0-rc5+ #11 NONE
[  704.891283] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.16.2 07/09/2024
[  704.898930] RIP: 0010:swap_cache_get_folio+0x2d/0xc0
[  704.903907] Code: 00 00 48 89 f9 49 89 f9 48 89 fe 48 c1 e1 06 49 c1 e9 3a 48 c1 e9 0f 48 c1 e1 05 4a 8b 04 cd c0 2e 7b 95 48 8b 78 60 48 01 cf <48> 8b 47 08 48 85 c0 74 20 48 89 f2 81 e2 ff 01 00 00 48 8d 04 d0
[  704.922720] RSP: 0018:ffffcf1227b1fc08 EFLAGS: 00010282
[  704.928035] RAX: ffff8be2cefb3c00 RBX: 0000555c65a5c000 RCX: 00003fffffffffe0
[  704.928036] RDX: 0003ffffffffffff RSI: 0003ffffffffffff RDI: ffffcbe2de8dffe0
[  704.928037] RBP: 0000000000000000 R08: ffff8be2de8e0520 R09: 0000000000000000
         Unmount[  704.928038] R10: 000000000000ffff R11: ffffcf12236f4000 R12: ffff8be2d5b8d968
[  704.928039] R13: 0003ffffffffffff R14: fffff3eec85eb000 R15: 0000555c65a51000
[  704.928039] FS:  00007f41fcab3800(0000) GS:ffff8c602b6fe000(0000) knlGS:0000000000000000
[  704.928040] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  704.928041] CR2: ffffcbe2de8dffe8 CR3: 00000074981af004 CR4: 0000000000770ef0
[  704.928041] PKRU: 55555554
[  704.928042] Call Trace:
[  704.928043]  <TASK>
[  704.928044]  unuse_pte_range+0x10b/0x290
[  704.928047]  unuse_pud_range.isra.0+0x149/0x190
[  704.928048]  unuse_vma+0x1a6/0x220
[  704.928050]  unuse_mm+0x9b/0x110
[  704.928052]  try_to_unuse+0xc5/0x260
[  704.928053]  __do_sys_swapoff+0x244/0x670
ing boo[  705.016662]  do_syscall_64+0x67/0xc50
[  705.016667]  ? do_user_addr_fault+0x15b/0x6c0
t.mount - /b[  705.026100]  ? exc_page_fault+0x60/0x100
[  705.031498]  ? irqentry_exit_to_user_mode+0x20/0xe0
oot...
[  705.036377]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  705.042200] RIP: 0033:0x7f41fc9271bb
[  705.045780] Code: 0f 1e fa 48 83 fe 01 48 8b 15 59 bc 0d 00 19 c0 83 e0 f0 83 c0 26 64 89 02 b8 ff ff ff ff c3 f3 0f 1e fa b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d bc 0d 00 f7 d8 64 89 01 48
[  705.064807] RSP: 002b:00007ffd14b5b6e8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a8
[  705.064809] RAX: ffffffffffffffda RBX: 00007ffd14b5cf30 RCX: 00007f41fc9271bb
[  705.064810] RDX: 0000000000000001 RSI: 0000000000000c00 RDI: 000055d48f533a40
[  705.064810] RBP: 00007ffd14b5b750 R08: 00007f41fca03b20 R09: 0000000000000000
[  705.064811] R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000000000
[  705.064811] R13: 0000000000000000 R14: 000055d4584f1479 R15: 000055d4584f2b20
[  705.064813]  </TASK>
[  705.064814] Modules linked in: xfrm_user xfrm_algo xt_addrtype xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables overlay bridge stp llc cfg80211 rfkill binfmt_misc ipmi_ssif amd_atl intel_rapl_msr intel_rapl_common wmi_bmof amd64_edac edac_mce_amd rapl mgag200 drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper acpi_cpufreq i2c_piix4 ptdma ipmi_si k10temp i2c_smbus acpi_power_meter wmi acpi_ipmi ipmi_msghandler sg dm_multipath fuse drm dm_mod nfnetlink ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 sd_mod kvm_amd ahci libahci kvm nvme tg3 libata ccp irqbypass nvme_core megaraid_sas [last unloaded: ipmi_devintf]
[  705.180420] CR2: ffffcbe2de8dffe8
[  705.183852] ---[ end trace 0000000000000000 ]---


I haven't had cycles to dig into this yet and been swamped with other things.

Fully understand, I'm _very_ familiar with this situation :)

I need more cores... ;)

Oh it's nice to have more :) I am bankrupt now, but it's nice to have more ;)

Cheers, Lorenzo

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help