Thread (79 messages) 79 messages, 8 authors, 2017-10-02

Re: [v8 0/4] cgroup-aware OOM killer

From: Michal Hocko <mhocko@kernel.org>
Date: 2017-09-25 12:24:05
Also in: linux-mm, lkml

I would really appreciate some feedback from Tejun, Johannes here.

On Wed 20-09-17 14:53:41, Roman Gushchin wrote:
On Mon, Sep 18, 2017 at 08:14:05AM +0200, Michal Hocko wrote:
quoted
On Fri 15-09-17 08:23:01, Roman Gushchin wrote:
quoted
On Fri, Sep 15, 2017 at 12:58:26PM +0200, Michal Hocko wrote:
[...]
quoted
quoted
quoted
But then you just enforce a structural restriction on your configuration
because
	root
        /  \
       A    D
      /\   
     B  C

is a different thing than
	root
        / | \
       B  C  D
I actually don't have a strong argument against an approach to select
largest leaf or kill-all-set memcg. I think, in practice there will be
no much difference.
I've tried to implement this approach, and it's really arguable.
Although your example looks reasonable, the opposite example is also valid:
you might want to compare whole hierarchies, and it's a quite typical usecase.

Assume, you have several containerized workloads on a machine (probably,
each will be contained in a memcg with memory.max set), with some hierarchy
of cgroups inside. Then in case of global memory shortage we want to reclaim
some memory from the biggest workload, and the selection should not depend
on group_oom settings. It would be really strange, if setting group_oom will
higher the chances to be killed.

In other words, let's imagine processes as leaf nodes in memcg tree. We decided
to select the biggest memcg and kill one or more processes inside (depending
on group_oom setting), but the memcg selection doesn't depend on it.
We do not compare processes from different cgroups, as well as cgroups with
processes. The same should apply to cgroups: why do we want to compare cgroups
from different sub-trees?

While size-based comparison can be implemented with this approach,
the priority-based is really weird (as David mentioned).
If priorities have no hierarchical meaning at all, we lack the very important
ability to enforce hierarchy oom_priority. Otherwise we have to invent some
complex rules of oom_priority propagation (e.g. is someone is raising
the oom_priority in parent, should it be applied to children immediately, etc).
I would really forget about the priority at this stage. This needs
really much more thinking and I consider the David's usecase very
specialized to use it as a template for a general purpose oom
prioritization. I might be wrong here of course...
The oom_group knob meaning also becoms more complex. It affects both
the victim selection and OOM action. _ANY_ mechanism which allows to affect
OOM victim selection (either priorities, either bpf-based approach) should
not have global system-wide meaning, it breaks everything.

I do understand your point, but the same is true for other stuff, right?
E.g. cpu time distribution (and io, etc) depends on hierarchy configuration.
It's a limitation, but it's ok, as user should create a hierarchy which
reflects some logical relations between processes and groups of processes.
Otherwise we're going to the configuration hell.
And that is _exactly_ my concern. We surely do not want tell people that
they have to consider their cgroup tree structure to control the global
oom behavior. You simply do not have that constrain with leaf-only
semantic and if kill-all intermediate nodes are used then there is an
explicit opt-in for the hierarchy considerations.
In any case, OOM is a last resort mechanism. The goal is to reclaim some memory
and do not crash the system or do not leave it in totally broken state.
Any really complex mm in userspace should be applied _before_ OOM happens.
So, I don't think we have to support all possible configurations here,
if we're able to achieve the main goal (kill some processes and do not leave
broken systems/containers).
True but we want to have the semantic reasonably understandable. And it
is quite hard to explain that the oom killer hasn't selected the largest
memcg just because it happened to be in a deeper hierarchy which has
been configured to cover a different resource.

I am sorry to repeat my self and I will not argue if there is a
prevalent agreement that level-by-level comparison is considered
desirable and documented behavior but, by all means, do not define this
semantic based on a priority requirements and/or implementation details.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help