Re: [PATCH RFC] memcg: close the race window between OOM detection and killing
From: Michal Hocko <hidden>
Date: 2015-06-05 14:35:37
Also in:
linux-mm
On Fri 05-06-15 04:29:36, Tejun Heo wrote:
Hello, Michal. On Thu, Jun 04, 2015 at 11:30:31AM +0200, Michal Hocko wrote:quoted
quoted
Hmmm? In -mm, if __alloc_page_may_oom() fails trylock, it never calls out_of_memory().Sure but the oom_lock might be free already. out_of_memory doesn't wait for the victim to finish. It just does schedule_timeout_killable.That doesn't matter because the detection and TIF_MEMDIE assertion are atomic w.r.t. oom_lock and TIF_MEMDIE essentially extends the locking by preventing further OOM kills. Am I missing something?
This is true but TIF_MEMDIE releasing is not atomic wrt. the allocation
path. So the oom victim could have released memory and dropped
TIF_MEMDIE but the allocation path hasn't noticed that because it's passed
/*
* Go through the zonelist yet one more time, keep very high watermark
* here, this is only to catch a parallel oom killing, we must fail if
* we're still under heavy pressure.
*/
page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
and goes on to kill another task because there is no TIF_MEMDIE
anymore.
quoted
quoted
The main difference here is that the alloc path does the whole thing synchrnously and thus the OOM detection and killing can be put in the same critical section which isn't the case for the memcg OOM handling.This is true but there is still a time window between the last allocation attempt and out_of_memory when the OOM victim might have exited and another task would be selected.Please see above.quoted
quoted
quoted
This is not the only reason. In-kernel memcg oom handling needs it as well. See 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM"). In fact it was the in-kernel case which has triggered this change. We simply cannot wait for oom with the stack and all the state the charge is called from.Why should this be any different from OOM handling from page allocator tho?Yes the global OOM is prone to deadlock. This has been discussed a lot and we still do not have a good answer for that. The primary problem is that small allocations do not fail and retry indefinitely so an OOM victim might be blocked on a lock held by a task which is the allocator. This is less likely and harder to trigger with standard loads than in memcg environment though.Deadlocks from infallible allocations getting interlocked are different. OOM killer can't really get around that by itself but I'm not talking about those deadlocks but at the same time they're a lot less likely. It's about OOM victim trapped in a deadlock failing to release memory because someone else is waiting for that memory to be released while blocking the victim.
I thought those would be in the allocator context - which was the example I've provided. What kind of context do you have in mind?
Sure, the two issues are related but once you solve things getting blocked on single OOM victim, it becomes a lot less of an issue.quoted
There have been suggestions to add an OOM timeout and ignore the previous OOM victim after the timeout expires and select a new victim. This sounds attractive but this approach has its own problems (http://marc.info/?l=linux-mm&m=141686814824684&w=2).Here are the the issues the message lists
Let's focus on discussing those points in reply to Johannes' email. AFAIU your notes very in line with his. -- Michal Hocko SUSE Labs