Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

From: 程垲涛 Chengkaitao Cheng <hidden>
Date: 2022-12-01 07:49:14
Also in: cgroups, linux-fsdevel, linux-mm, lkml

At 2022-12-01 07:29:11, "Roman Gushchin" [off-list ref] wrote:

On Wed, Nov 30, 2022 at 03:01:58PM +0800, chengkaitao wrote:

quoted

From: chengkaitao <redacted>

We created a new interface <memory.oom.protect> for memory, If there is
the OOM killer under parent memory cgroup, and the memory usage of a
child cgroup is within its effective oom.protect boundary, the cgroup's
tasks won't be OOM killed unless there is no unprotected tasks in other
children cgroups. It draws on the logic of <memory.min/low> in the
inheritance relationship.

It has the following advantages,
1. We have the ability to protect more important processes, when there
is a memcg's OOM killer. The oom.protect only takes effect local memcg,
and does not affect the OOM killer of the host.
2. Historically, we can often use oom_score_adj to control a group of
processes, It requires that all processes in the cgroup must have a
common parent processes, we have to set the common parent process's
oom_score_adj, before it forks all children processes. So that it is
very difficult to apply it in other situations. Now oom.protect has no
such restrictions, we can protect a cgroup of processes more easily. The
cgroup can keep some memory, even if the OOM killer has to be called.

It reminds me our attempts to provide a more sophisticated cgroup-aware oom
killer.

As you said, I also like simple strategies and concise code very much, so in 
the design of oom.protect, we reuse the evaluation method of oom_score, 
we draws on the logic of <memory.min/low> in the inheritance relationship. 
Memory.min/low have been demonstrated for a long time. I did it to reduce 
the burden on the kernel.

The problem is that the decision which process(es) to kill or preserve
is individual to a specific workload (and can be even time-dependent
for a given workload).

It is correct to kill a process with high workload, but it may not be the 
most appropriate. I think the specific process to kill needs to be decided 
by the user. I think it is the original intention of score_adj design.

So it's really hard to come up with an in-kernel
mechanism which is at the same time flexible enough to work for the majority
of users and reliable enough to serve as the last oom resort measure (which
is the basic goal of the kernel oom killer).

Our goal is to find a method that is less intrusive to the existing 
mechanisms of the kernel, and find a more reasonable supplement 
or alternative to the limitations of score_adj.

Previously the consensus was to keep the in-kernel oom killer dumb and reliable
and implement complex policies in userspace (e.g. systemd-oomd etc).

Is there a reason why such approach can't work in your case?

I think that as kernel developers, we should try our best to provide 
users with simpler and more powerful interfaces. It is clear that the 
current oom score mechanism has many limitations. Users need to 
do a lot of timed loop detection in order to complete work similar 
to the oom score mechanism, or develop a new mechanism just to 
skip the imperfect oom score mechanism. This is an inefficient and 
forced behavior

Thanks for your comment!

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help