Re: [v8 0/4] cgroup-aware OOM killer
From: Tejun Heo <tj@kernel.org>
Date: 2017-09-22 21:05:26
Also in:
linux-mm, lkml
Hello, On Fri, Sep 22, 2017 at 01:39:55PM -0700, David Rientjes wrote:
Current heuristic based on processes is coupled with per-process /proc/pid/oom_score_adj. The proposed heuristic has no ability to be influenced by userspace, and it needs one. The proposed heuristic based on memory cgroups coupled with Roman's per-memcg memory.oom_priority is appropriate and needed. It is not
So, this is where we disagree. I don't think it's a good design.
"sophisticated intelligence," it merely allows userspace to protect vital memory cgroups when opting into the new features (cgroups compared based on size and memory.oom_group) that we very much want.
which can't achieve that goal very well for wide variety of users.
quoted
We even change the whole scheduling behaviors and try really hard to not get locked into specific implementation details which exclude future improvements. Guaranteeing OOM killing selection would be crazy. Why would we prevent ourselves from doing things better in the future? We aren't talking about the semantics of read(2) here. This is a kernel emergency mechanism to avoid deadlock at the last moment.We merely want to prefer other memory cgroups are oom killed on system oom conditions before important ones, regardless if the important one is using more memory than the others because of the new heuristic this patchset introduces. This is exactly the same as /proc/pid/oom_score_adj for the current heuristic.
You were arguing that we should lock into a specific heuristics and guarantee the same behavior. We shouldn't. When we introduce a user visible interface, we're making a lot of promises. My point is that we need to be really careful when making those promises.
If you have this low priority maintenance job charging memory to the high priority hierarchy, you're already misconfigured unless you adjust /proc/pid/oom_score_adj because it will oom kill any larger process than itself in today's kernels anyway. A better configuration would be attach this hypothetical low priority maintenance job to its own sibling cgroup with its own memory limit to avoid exactly that problem: it going berserk and charging too much memory to the high priority container that results in one of its processes getting oom killed.
And how do you guarantee that across delegation boundaries? The points you raise on why the priority should be applied level-by-level are exactly the same points why this doesn't really work. OOM killing priority isn't something which can be distributed across cgroup hierarchy level-by-level. The resulting decision tree doesn't make any sense. I'm not against adding something which works but strict level-by-level comparison isn't the solution. Thanks. -- tejun