Re: [RFC] memory reserve for userspace oom-killer

From: Suren Baghdasaryan <hidden>
Date: 2021-05-05 02:59:35
Also in: linux-mm, lkml

On Tue, May 4, 2021 at 7:45 PM Shakeel Butt [off-list ref] wrote:

On Tue, May 4, 2021 at 6:26 PM Suren Baghdasaryan [off-list ref] wrote:

quoted

On Tue, May 4, 2021 at 5:37 PM Shakeel Butt [off-list ref] wrote:

quoted

On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko [off-list ref] wrote:

quoted

[...]

quoted

What if the pool is depleted?

This would mean that either the estimate of mempool size is bad or
oom-killer is buggy and leaking memory.

I am open to any design directions for mempool or some other way where
we can provide a notion of memory guarantee to oom-killer.

OK, thanks for clarification. There will certainly be hard problems to
sort out[1] but the overall idea makes sense to me and it sounds like a
much better approach than a OOM specific solution.


[1] - how the pool is going to be replenished without hitting all
potential reclaim problems (thus dependencies on other all tasks
directly/indirectly) yet to not rely on any background workers to do
that on the task behalf without a proper accounting etc...
--

I am currently contemplating between two paths here:

First, the mempool, exposed through either prctl or a new syscall.
Users would need to trace their userspace oom-killer (or whatever
their use case is) to find an appropriate mempool size they would need
and periodically refill the mempools if allowed by the state of the
machine. The challenge here is to find a good value for the mempool
size and coordinating the refilling of mempools.

Second is a mix of Roman and Peter's suggestions but much more
simplified. A very simple watchdog with a kill-list of processes and
if userspace didn't pet the watchdog within a specified time, it will
kill all the processes in the kill-list. The challenge here is to
maintain/update the kill-list.

IIUC this solution is designed to identify cases when oomd/lmkd got
stuck while allocating memory due to memory shortages and therefore
can't feed the watchdog. In such a case the kernel goes ahead and
kills some processes to free up memory and unblock the blocked
process. Effectively this would limit the time such a process gets
stuck by the duration of the watchdog timeout. If my understanding of
this proposal is correct,

Your understanding is indeed correct.

quoted

then I see the following downsides:
1. oomd/lmkd are still not prevented from being stuck, it just limits
the duration of this blocked state. Delaying kills when memory
pressure is high even for short duration is very undesirable.

Yes I agree.

quoted

I think
having mempool reserves could address this issue better if it can
always guarantee memory availability (not sure if it's possible in
practice).

I think "mempool ... always guarantee memory availability" is
something I should quantify with some experiments.

quoted

2. What would be performance overhead of this watchdog? To limit the
duration of a process being blocked to a small enough value we would
have to have quite a small timeout, which means oomd/lmkd would have
to wake up quite often to feed the watchdog. Frequent wakeups on a
battery-powered system is not a good idea.

This is indeed the downside i.e. the tradeoff between acceptable stall
vs frequent wakeups.

quoted

3. What if oomd/lmkd gets stuck for some memory-unrelated reason and
can't feed the watchdog? In such a scenario the kernel would assume
that it is stuck due to memory shortages and would go on a killing
spree.

This is correct but IMHO killing spree is not worse than oomd/lmkd
getting stuck for some other reason.

quoted

If there is a sure way to identify when a process gets stuck
due to memory shortages then this could work better.

Hmm are you saying looking at the stack traces of the userspace
oom-killer or some metrics related to oom-killer? It will complicate
the code.

Well, I don't know of a sure and easy way to identify the reasons for
process blockage but maybe there is one I don't know of? My point is
that we would need some additional indications of memory being the
culprit for the process blockage before resorting to kill.

quoted

4. Additional complexity of keeping the list of potential victims in
the kernel. Maybe we can simply reuse oom_score to choose the best
victims?

Your point of additional complexity is correct. Regarding oom_score I
think you meant oom_score_adj, I would avoid putting more
policies/complexity in the kernel but I got your point that the
simplest watchdog might not be helpful at all.

quoted

Thanks,
Suren.

quoted

I would prefer the direction which oomd and lmkd are open to adopt.

Any suggestions?

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help