Re: [PATCH v3] mm/page_alloc: bail out on fatal signal during reclaim/compaction retry attempt
From: Matthew Wilcox <willy@infradead.org>
Date: 2021-05-20 12:41:00
Also in:
lkml
On Thu, May 20, 2021 at 12:42:57PM +0100, Aaron Tomlin wrote:
On Thu 2021-05-20 12:20 +0200, Vlastimil Babka wrote:quoted
On 5/20/21 6:34 AM, Andrew Morton wrote:quoted
What observed problems motivated this change? What were the observed runtime effects of this change?Yep those details from the previous thread should be included here.Fair enough. During kernel crash dump/or vmcore analysis: I discovered in the context of __alloc_pages_slowpath() the value stored in the no_progress_loops variable was found to be 31,611,688 i.e. well above MAX_RECLAIM_RETRIES; and a fatal signal was pending against current.
While this is true, it's not really answering Andrew's question. What we want as part of the commit message is something like: "A customer experienced a low memory situation and sent their task a fatal signal. Instead of dying promptly, it looped in the page allocator failing to make progress because ..."
#6 [ffff00002e78f7c0] do_try_to_free_pages+0xe4 at ffff00001028bd24
#7 [ffff00002e78f840] try_to_free_pages+0xe4 at ffff00001028c0f4
#8 [ffff00002e78f900] __alloc_pages_nodemask+0x500 at ffff0000102cd130
// w28 = *(sp + 148) /* no_progress_loops */
0xffff0000102cd1e0 <__alloc_pages_nodemask+0x5b0>: ldr w0, [sp,#148]
// w0 = w0 + 0x1
0xffff0000102cd1e4 <__alloc_pages_nodemask+0x5b4>: add w0, w0, #0x1
// *(sp + 148) = w0
0xffff0000102cd1e8 <__alloc_pages_nodemask+0x5b8>: str w0, [sp,#148]
// if (w0 >= 0x10)
// goto __alloc_pages_nodemask+0x904
0xffff0000102cd1ec <__alloc_pages_nodemask+0x5bc>: cmp w0, #0x10
0xffff0000102cd1f0 <__alloc_pages_nodemask+0x5c0>: b.gt 0xffff0000102cd534
- The stack pointer was 0xffff00002e78f900
crash> p *(int *)(0xffff00002e78f900+148)
$1 = 31611688
crash> ps 521171
PID PPID CPU TASK ST %MEM VSZ RSS COMM
> 521171 1 36 ffff8080e2128800 RU 0.0 34789440 18624 special
crash> p &((struct task_struct *)0xffff8080e2128800)->signal.shared_pending
$2 = (struct sigpending *) 0xffff80809a416e40
crash> p ((struct sigpending *)0xffff80809a416e40)->signal.sig[0]
$3 = 0x804100
crash> sig -s 0x804100
SIGKILL SIGTERM SIGXCPU
crash> p ((struct sigpending *)0xffff80809a416e40)->signal.sig[0] & 1U << (9 - 1)
$4 = 0x100
Unfortunately, this incident was not reproduced, to date.
Kind regards,
--
Aaron Tomlin