Thread (3 messages) 3 messages, 2 authors, 2013-09-05

Re: memcg creates an unkillable task in 3.11-rc2

From: Michal Hocko <hidden>
Date: 2013-08-01 09:06:33
Also in: lkml

Possibly related (same subject, not in this thread)

On Wed 31-07-13 15:09:16, Eric W. Biederman wrote:
Michal Hocko [off-list ref] writes:
quoted
[I am CCing David here as well]

On Tue 30-07-13 09:37:46, Eric W. Biederman wrote:
quoted
Michal Hocko [off-list ref] writes:
quoted
On Tue 30-07-13 01:19:31, Eric W. Biederman wrote:
[...]
quoted
Hmm. Looking farther I see what is going on. And it has nothing to do
with the freezer. (I have commented out that code and reproduced it
without the freezer to be doubly certain).


On the exit path exit_robust_list is triggering a page fault to fault a
page back in.  Which since we have no memory causes the exit path
to get stuck in mem_cgroup_handle_oom.
Hmm, interesting. I assume the exit is caused by the SIGKILL, right?
If yes, then why it hasn't coughed early in __mem_cgroup_try_charge
Interesting question.  This isn't the primary thread but we do send
SIGKILL to the secondary threads as well.

We definitely need those checks on both paths making my change valid.

Oh. Duh!  This is after we act on SIGKILL so SIGKILL is no longer
pending.
Very well spotted Eric! What do you think about the following patch?
I would have to check since when the exit path could trigger the fault
but I guess this is worth stable backport.
It doesn't have a prayer of working.
So it hasn't passed your test?
You leave open the race of a fatal signal being received before we go to
sleep.
If a fatal signal is received before we're going to sleep then
schedule() should keep it on the runqueue, no?

static void __sched __schedule(void)
{
[...]
        if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
                if (unlikely(signal_pending_state(prev->state, prev))) {
                        prev->state = TASK_RUNNING;
                } else {

so it should get a timeslice eventually, mem_cgroup_handle_oom sees
fatal_signal_pending and sets TIF_MEMDIE, bypass the charge, get to
signal handling, start exiting, fault in, get to charge and bail out in
__mem_cgroup_try_charge because it sees TIF_MEMDIE.

Or what am I missing?
You don't handle a task that has processed the fatal signal and is in
PF_EXITING.  Which is what I experienced.

From earlier comments about my code not being early enough I thought I
was going to see a patch in __mem_cgroup_try_change so that the bypass
case will kick in also for tasks in PF_EXITING.
This shouldn't be necessary because TIF_MEMDIE was set for the killed
task.  I was playing with PF_EXITING there as well but TIF_MEMDIE sounds
like a more appropriate solution.
You change actually addresses things later in the code path than mine
does.

I do like your summary of the problem.

Eric
quoted
---
From 411408558f2858328ea25e69567e9a53a8314032 Mon Sep 17 00:00:00 2001
From: Michal Hocko <redacted>
Date: Wed, 31 Jul 2013 08:48:54 +0200
Subject: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM

Eric has reported that he can see task(s) stuck in memcg OOM handler
regularly. The only way out is to
	echo 0 > $GROUP/memory.oom_controll

His usecase is:
- Setup a hierarchy with memory and the freezer
  (disable kernel oom and have a process watch for oom).
- In that memory cgroup add a process with one thread per cpu.
- In one thread slowly allocate once per second I think it is 16M of ram
  and mlock and dirty it (just to force the pages into ram and stay there).
- When oom is achieved loop:
  * attempt to freeze all of the tasks.
  * if frozen send every task SIGKILL, unfreeze, remove the directory in
    cgroupfs.

Eric has then pinpointed the issue to be memcg specific.

All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
Those that have received fatal signal will bypass the charge and should
continue on their way out. The tricky part is that that exit path might
trigger a page fault (e.g. exit_robust_list) thus the memcg charge
while its memcg is still under OOM because nobody has released any
charges. Unlike with the in-kernel OOM handler the exiting task doesn't
get TIF_MEMDIE set so it doesn't shortcut charges and falls to the
memcg OOM again without any way out of it as there are no fatal signals
pending anymore.

This patch sets the TIF_MEMDIE flag pro actively in mem_cgroup_handle_oom
if the memcg is disabled after the task is woken up with fatal signal
pending. This means that any further charges will be bypassed early in
__mem_cgroup_try_charge and the task will have chance to exit finally.

Strictly speaking we might mark also a task which hasn't been killed by
userspace OOM handler but this is not harmful as the task is going away
anyway and under-oom group would like to see it go as soon as possible.

Reported-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Debugged-by: Eric W. Biederman [off-list ref]
Signed-off-by: Michal Hocko <redacted>
---
 mm/memcontrol.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d12ca6f..d4103b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2235,8 +2235,19 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
 
 	mem_cgroup_unmark_under_oom(memcg);
 
-	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+	if (test_thread_flag(TIF_MEMDIE))
 		return false;
+
+	/*
+	 * Userspace OOM killer might have killed this task but
+	 * there is no way it could have set TIF_MEMDIE as well
+	 * so we have to set it manually.
+	 */
+	if (fatal_signal_pending(current)) {
+		if (memcg->oom_kill_disable)
+			set_thread_flag(TIF_MEMDIE);
+		return false;
+	}
 	/* Give chance to dying process */
 	schedule_timeout_uninterruptible(1);
 	return true;
-- 
1.8.3.2
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Michal Hocko
SUSE Labs
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help