Re: memcg creates an unkillable task in 3.11-rc2
From: Michal Hocko <hidden>
Date: 2013-08-01 09:06:33
Also in:
lkml
Possibly related (same subject, not in this thread)
- 2013-11-12 · Re: memcg creates an unkillable task in 3.11-rc2 · Michal Hocko <hidden>
- 2013-09-27 · Re: memcg creates an unkillable task in 3.11-rc2 · Eric W. Biederman <hidden>
- 2013-09-26 · Re: memcg creates an unkillable task in 3.11-rc2 · Fabio Kung <hidden>
- 2013-09-09 · Re: memcg creates an unkillable task in 3.11-rc2 · Michal Hocko <hidden>
- 2013-09-06 · Re: memcg creates an unkillable task in 3.11-rc2 · Eric W. Biederman <hidden>
On Wed 31-07-13 15:09:16, Eric W. Biederman wrote:
Michal Hocko [off-list ref] writes:quoted
[I am CCing David here as well] On Tue 30-07-13 09:37:46, Eric W. Biederman wrote:quoted
Michal Hocko [off-list ref] writes:quoted
On Tue 30-07-13 01:19:31, Eric W. Biederman wrote: [...]quoted
Hmm. Looking farther I see what is going on. And it has nothing to do with the freezer. (I have commented out that code and reproduced it without the freezer to be doubly certain). On the exit path exit_robust_list is triggering a page fault to fault a page back in. Which since we have no memory causes the exit path to get stuck in mem_cgroup_handle_oom.Hmm, interesting. I assume the exit is caused by the SIGKILL, right? If yes, then why it hasn't coughed early in __mem_cgroup_try_chargeInteresting question. This isn't the primary thread but we do send SIGKILL to the secondary threads as well. We definitely need those checks on both paths making my change valid. Oh. Duh! This is after we act on SIGKILL so SIGKILL is no longer pending.Very well spotted Eric! What do you think about the following patch? I would have to check since when the exit path could trigger the fault but I guess this is worth stable backport.It doesn't have a prayer of working.
So it hasn't passed your test?
You leave open the race of a fatal signal being received before we go to sleep.
If a fatal signal is received before we're going to sleep then
schedule() should keep it on the runqueue, no?
static void __sched __schedule(void)
{
[...]
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
if (unlikely(signal_pending_state(prev->state, prev))) {
prev->state = TASK_RUNNING;
} else {
so it should get a timeslice eventually, mem_cgroup_handle_oom sees
fatal_signal_pending and sets TIF_MEMDIE, bypass the charge, get to
signal handling, start exiting, fault in, get to charge and bail out in
__mem_cgroup_try_charge because it sees TIF_MEMDIE.
Or what am I missing?
You don't handle a task that has processed the fatal signal and is in PF_EXITING. Which is what I experienced. From earlier comments about my code not being early enough I thought I was going to see a patch in __mem_cgroup_try_change so that the bypass case will kick in also for tasks in PF_EXITING.
This shouldn't be necessary because TIF_MEMDIE was set for the killed task. I was playing with PF_EXITING there as well but TIF_MEMDIE sounds like a more appropriate solution.
You change actually addresses things later in the code path than mine does. I do like your summary of the problem. Ericquoted
--- From 411408558f2858328ea25e69567e9a53a8314032 Mon Sep 17 00:00:00 2001 From: Michal Hocko <redacted> Date: Wed, 31 Jul 2013 08:48:54 +0200 Subject: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM Eric has reported that he can see task(s) stuck in memcg OOM handler regularly. The only way out is to echo 0 > $GROUP/memory.oom_controll His usecase is: - Setup a hierarchy with memory and the freezer (disable kernel oom and have a process watch for oom). - In that memory cgroup add a process with one thread per cpu. - In one thread slowly allocate once per second I think it is 16M of ram and mlock and dirty it (just to force the pages into ram and stay there). - When oom is achieved loop: * attempt to freeze all of the tasks. * if frozen send every task SIGKILL, unfreeze, remove the directory in cgroupfs. Eric has then pinpointed the issue to be memcg specific. All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled. Those that have received fatal signal will bypass the charge and should continue on their way out. The tricky part is that that exit path might trigger a page fault (e.g. exit_robust_list) thus the memcg charge while its memcg is still under OOM because nobody has released any charges. Unlike with the in-kernel OOM handler the exiting task doesn't get TIF_MEMDIE set so it doesn't shortcut charges and falls to the memcg OOM again without any way out of it as there are no fatal signals pending anymore. This patch sets the TIF_MEMDIE flag pro actively in mem_cgroup_handle_oom if the memcg is disabled after the task is woken up with fatal signal pending. This means that any further charges will be bypassed early in __mem_cgroup_try_charge and the task will have chance to exit finally. Strictly speaking we might mark also a task which hasn't been killed by userspace OOM handler but this is not harmful as the task is going away anyway and under-oom group would like to see it go as soon as possible. Reported-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Debugged-by: Eric W. Biederman [off-list ref] Signed-off-by: Michal Hocko <redacted> --- mm/memcontrol.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d12ca6f..d4103b0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c@@ -2235,8 +2235,19 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, mem_cgroup_unmark_under_oom(memcg); - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + if (test_thread_flag(TIF_MEMDIE)) return false; + + /* + * Userspace OOM killer might have killed this task but + * there is no way it could have set TIF_MEMDIE as well + * so we have to set it manually. + */ + if (fatal_signal_pending(current)) { + if (memcg->oom_kill_disable) + set_thread_flag(TIF_MEMDIE); + return false; + } /* Give chance to dying process */ schedule_timeout_uninterruptible(1); return true;-- 1.8.3.2-- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-- Michal Hocko SUSE Labs