Thread (79 messages) 79 messages, 8 authors, 2017-10-02

Re: [v8 0/4] cgroup-aware OOM killer

From: Tejun Heo <tj@kernel.org>
Date: 2017-09-22 21:05:26
Also in: linux-mm, lkml

Hello,

On Fri, Sep 22, 2017 at 01:39:55PM -0700, David Rientjes wrote:
Current heuristic based on processes is coupled with per-process
/proc/pid/oom_score_adj.  The proposed 
heuristic has no ability to be influenced by userspace, and it needs one.  
The proposed heuristic based on memory cgroups coupled with Roman's 
per-memcg memory.oom_priority is appropriate and needed.  It is not 
So, this is where we disagree.  I don't think it's a good design.
"sophisticated intelligence," it merely allows userspace to protect vital 
memory cgroups when opting into the new features (cgroups compared based 
on size and memory.oom_group) that we very much want.
which can't achieve that goal very well for wide variety of users.
quoted
We even change the whole scheduling behaviors and try really hard to
not get locked into specific implementation details which exclude
future improvements.  Guaranteeing OOM killing selection would be
crazy.  Why would we prevent ourselves from doing things better in the
future?  We aren't talking about the semantics of read(2) here.  This
is a kernel emergency mechanism to avoid deadlock at the last moment.
We merely want to prefer other memory cgroups are oom killed on system oom 
conditions before important ones, regardless if the important one is using 
more memory than the others because of the new heuristic this patchset 
introduces.  This is exactly the same as /proc/pid/oom_score_adj for the 
current heuristic.
You were arguing that we should lock into a specific heuristics and
guarantee the same behavior.  We shouldn't.

When we introduce a user visible interface, we're making a lot of
promises.  My point is that we need to be really careful when making
those promises.
If you have this low priority maintenance job charging memory to the high 
priority hierarchy, you're already misconfigured unless you adjust 
/proc/pid/oom_score_adj because it will oom kill any larger process than 
itself in today's kernels anyway.

A better configuration would be attach this hypothetical low priority 
maintenance job to its own sibling cgroup with its own memory limit to 
avoid exactly that problem: it going berserk and charging too much memory 
to the high priority container that results in one of its processes 
getting oom killed.
And how do you guarantee that across delegation boundaries?  The
points you raise on why the priority should be applied level-by-level
are exactly the same points why this doesn't really work.  OOM killing
priority isn't something which can be distributed across cgroup
hierarchy level-by-level.  The resulting decision tree doesn't make
any sense.

I'm not against adding something which works but strict level-by-level
comparison isn't the solution.

Thanks.

-- 
tejun
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help