Thread (42 messages) 42 messages, 8 authors, 2020-09-09

Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control

From: Michal Hocko <mhocko@suse.com>
Date: 2020-09-09 15:14:45
Also in: cgroups, linux-fsdevel, linux-mm, lkml

[Sorry, this slipped through cracks]

On Mon 24-08-20 12:58:50, Johannes Weiner wrote:
On Fri, Aug 21, 2020 at 09:37:16PM +0200, Peter Zijlstra wrote:
[...]
quoted
Arguably seeing the rate drop to near 0 is a very good point to consider
running cgroup-OOM.
Agreed. In the past, that's actually what we did: In cgroup1, you
could disable the kernel OOM killer, and when reclaim failed at the
limit, the allocating task would be put on a waitqueue until woken up
by a freeing event. Conceptually this is clean & straight-forward.

However,

1. Putting allocation contexts with unknown locks to indefinite sleep
   caused deadlocks, for obvious reasons. Userspace OOM killing tends
   to take a lot of task-specific locks when scanning through /proc
   files for kill candidates, and can easily get stuck.

   Using bounded over indefinite waits is simply acknowledging that
   the deadlock potential when connecting arbitrary task stacks in the
   system through free->alloc ordering is equally difficult to plan
   out as alloc->free ordering.

   The non-cgroup OOM killer actually has the same deadlock potential,
   where the allocating/killing task can hold resources that the OOM
   victim requires to exit. The OOM reaper hides it, the static
   emergency reserves hide it - but to truly solve this problem, you
   would have to have full knowledge of memory & lock ordering
   dependencies of those tasks. And then can still end up with
   scenarios where the only answer is panic().
Yes. Even killing all eligible tasks is not guaranteed to help the
situation because a) resources might be not bound to a process life time
(e.g. tmpfs) or ineligible task might be holding resources that block
others to do the proper cleanup. OOM reaper is here to make sure we
reclaim some of the address space of the victim and we go over all
eligible tasks rather than getting stuck at the first victim for ever.
 
2. I don't recall ever seeing situations in cgroup1 where the precise
   matching of allocation rate to freeing rate has allowed cgroups to
   run sustainably after reclaim has failed. The practical benefit of
   a complicated feedback loop over something crude & robust once
   we're in an OOM situation is not apparent to me.
Yes, this is usually go OOM and kill something. Running on a very edge
of the (memcg) oom doesn't tend to be sustainable and I am not sure it
makes sense to optimize for.
   [ That's different from the IO-throttling *while still doing
     reclaim* that Dave brought up. *That* justifies the same effort
     we put into dirty throttling. I'm only talking about the
     situation where reclaim has already failed and we need to
     facilitate userspace OOM handling. ]

So that was the motivation for the bounded sleeps. They do not
guarantee containment, but they provide a reasonable amount of time
for the userspace OOM handler to intervene, without deadlocking.
Yes, memory.high is mostly a best effort containment. We do have the
hard limit to put a stop on runaways or if you are watching for PSI then
the high limit throttling would give you enough idea to take an action
from the userspace.
That all being said, the semantics of the new 'high' limit in cgroup2
have allowed us to move reclaim/limit enforcement out of the
allocation context and into the userspace return path.

See the call to mem_cgroup_handle_over_high() from
tracehook_notify_resume(), and the comments in try_charge() around
set_notify_resume().

This already solves the free->alloc ordering problem by allowing the
allocation to exceed the limit temporarily until at least all locks
are dropped, we know we can sleep etc., before performing enforcement.

That means we may not need the timed sleeps anymore for that purpose,
and could bring back directed waits for freeing-events again.

What do you think? Any hazards around indefinite sleeps in that resume
path? It's called before __rseq_handle_notify_resume and the
arch-specific resume callback (which appears to be a no-op currently).

Chris, Michal, what are your thoughts? It would certainly be simpler
conceptually on the memcg side.
I would need a more specific description. But as I've already said. It
doesn't seem that we are in a need to fix any practical problem here.
High limit implementation has changed quite a lot recently. I would
rather see it settled for a while and see how it behaves in wider
variety of workloads before changing the implementation again.

-- 
Michal Hocko
SUSE Labs
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help