Re: [PATCH RFC] memcg: close the race window between OOM detection and killing

From: Michal Hocko <hidden>
Date: 2015-06-05 14:35:37
Also in: linux-mm

On Fri 05-06-15 04:29:36, Tejun Heo wrote:

Hello, Michal.

On Thu, Jun 04, 2015 at 11:30:31AM +0200, Michal Hocko wrote:

quoted

Hmmm?  In -mm, if __alloc_page_may_oom() fails trylock, it never calls
out_of_memory().

Sure but the oom_lock might be free already. out_of_memory doesn't wait
for the victim to finish. It just does schedule_timeout_killable.

That doesn't matter because the detection and TIF_MEMDIE assertion are
atomic w.r.t. oom_lock and TIF_MEMDIE essentially extends the locking
by preventing further OOM kills.  Am I missing something?

This is true but TIF_MEMDIE releasing is not atomic wrt. the allocation
path. So the oom victim could have released memory and dropped
TIF_MEMDIE but the allocation path hasn't noticed that because it's passed
        /*
         * Go through the zonelist yet one more time, keep very high watermark
         * here, this is only to catch a parallel oom killing, we must fail if
         * we're still under heavy pressure.
         */
        page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
                                        ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);

and goes on to kill another task because there is no TIF_MEMDIE
anymore.

quoted

The main difference here is that the alloc path does the whole thing
synchrnously and thus the OOM detection and killing can be put in the
same critical section which isn't the case for the memcg OOM handling.

This is true but there is still a time window between the last
allocation attempt and out_of_memory when the OOM victim might have
exited and another task would be selected.

Please see above.

quoted

This is not the only reason. In-kernel memcg oom handling needs it
as well. See 3812c8c8f395 ("mm: memcg: do not trap chargers with
full callstack on OOM"). In fact it was the in-kernel case which has
triggered this change. We simply cannot wait for oom with the stack and
all the state the charge is called from.

Why should this be any different from OOM handling from page allocator
tho?

Yes the global OOM is prone to deadlock. This has been discussed a lot
and we still do not have a good answer for that. The primary problem
is that small allocations do not fail and retry indefinitely so an OOM
victim might be blocked on a lock held by a task which is the allocator.
This is less likely and harder to trigger with standard loads than in
memcg environment though.

Deadlocks from infallible allocations getting interlocked are
different.  OOM killer can't really get around that by itself but I'm
not talking about those deadlocks but at the same time they're a lot
less likely.  It's about OOM victim trapped in a deadlock failing to
release memory because someone else is waiting for that memory to be
released while blocking the victim.

I thought those would be in the allocator context - which was the
example I've provided. What kind of context do you have in mind?

Sure, the two issues are related
but once you solve things getting blocked on single OOM victim, it
becomes a lot less of an issue.

quoted

There have been suggestions to add an OOM timeout and ignore the
previous OOM victim after the timeout expires and select a new
victim. This sounds attractive but this approach has its own problems
(http://marc.info/?l=linux-mm&m=141686814824684&w=2).

Here are the the issues the message lists

Let's focus on discussing those points in reply to Johannes' email. AFAIU
your notes very in line with his.
-- 
Michal Hocko
SUSE Labs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help