Re: [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock

From: Michal Hocko <hidden>
Date: 2011-08-09 07:26:19
Also in: lkml

On Tue 09-08-11 01:19:12, Johannes Weiner wrote:

On Mon, Aug 08, 2011 at 11:47:04PM +0200, Michal Hocko wrote:

quoted

On Mon 08-08-11 20:47:38, Johannes Weiner wrote:
[...]

quoted

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c

@@ -2071,7 +2071,6 @@ struct memcg_stock_pcp {
 #define FLUSHING_CACHED_CHARGE	(0)
 };
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
-static DEFINE_MUTEX(percpu_charge_mutex);
 
 /*
  * Try to consume stocked charge on this cpu. If success, one page is consumed

@@ -2178,7 +2177,8 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
-		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
+				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
 			flush_work(&stock->work);
 	}
 out:

This hunk triggers a crash for me, as the draining is already done and
stock->cached reset to NULL when dereferenced here.  Oops is attached.

Thanks for catching this. We are racing synchronous drain from
force_empty and async drain from reclaim, I guess.

It's at the end of a benchmark where several tasks delete the cgroups.
There is no reclaim going on anymore, it must be several sync drains
from force_empty of different memcgs.

I am afraid it is worse than that. Sync. drainer can kick itself really
trivial just by doing local draining. Then stock->cached is guaranteed
to be NULL when we reach this...

[...]

quoted

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4ec4e7..626c916 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c

@@ -2197,8 +2197,10 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
-		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
-				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+		struct mem_cgroup *mem = stock->cached;
+		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags) &&
+				 mem && mem_cgroup_same_or_subtree(root_mem, mem)
+				)
 			flush_work(&stock->work);
 	}
 out:

This ordering makes sure that mem is a sensible pointer, but it still
does not pin the object, *mem, which could go away the femtosecond
after the test_bit succeeds.

Yes there are possible races
	CPU0			CPU1				CPU2
mem = stock->cached
			stock->cached = NULL
			clear_bit
							    test_and_set_bit
test_bit()
<preempted>		...
			mem_cgroup_destroy
use after free

The race is not very probable because we are doing quite a bunch of work
before we can deallocate mem. mem_cgroup_destroy is called after
synchronize_rcu so we can close the race by rcu_read_lock, right?


[...]

quoted

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4ec4e7..eca46141 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c

@@ -2179,17 +2179,23 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
 		struct mem_cgroup *mem;
 
+		/* Try to lock the cache */
+		if(test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+			continue;
+
 		mem = stock->cached;
 		if (!mem || !stock->nr_pages)
-			continue;
+			goto unlock_cache;
 		if (!mem_cgroup_same_or_subtree(root_mem, mem))
-			continue;
-		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
-			if (cpu == curcpu)
-				drain_local_stock(&stock->work);
-			else
-				schedule_work_on(cpu, &stock->work);
-		}
+			goto unlock_cache;

So one thread locks the cache, recognizes stock->cached is not in the
right hierarchy and unlocks again.  While the cache was locked, a
concurrent drainer with the right hierarchy skipped the stock because
it was locked.  That doesn't sound right.

You are right. We have to make it less parallel.

But yes, we probably need exclusive access to the stock state.

quoted

+
+		if (cpu == curcpu)
+			drain_local_stock(&stock->work);
+		else
+			schedule_work_on(cpu, &stock->work);
+		continue;
+unlock_cache:
+		clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);

                ^^^^^
		need a barrier?
 	}
 
 	if (!sync)

quoted

Without the mutex serializing this code, can't there be a concurrent
execution that leads to stock->cached being drained, becoming empty
and freed by someone else between the stock->nr_pages check and the
ancestor check, resulting in use after free?

What makes stock->cached safe to dereference?

We are using FLUSHING_CACHED_CHARGE as a lock for local draining. I
guess it should be sufficient.

mutex which was used previously caused that async draining was exclusive
so a root_mem that has potentially many relevant caches has to back off
because other mem wants to clear the cache on the same CPU.

It's now replaced by what is essentially a per-stock bit-spinlock that
is always trylocked.

Yes.

Would it make sense to promote it to a real spinlock?  Draining a
stock is pretty fast, there should be minimal lock hold times, but we
would still avoid that tiny race window where we would skip otherwise
correct stocks just because they are locked.

Yes, per-stock spinlock will be easiest way because if tried other games
with atomics we would endup with a more complicated refill_stock which
can race with draining as well.

quoted

I will think about this tomorrow (with fresh eyes). I think we should be
able to be without mutex.

The problem is that we have a multi-op atomic section, so we can not
go lockless.  We can read the stock state just fine, and order
accesses to different members so that we get a coherent image.  But
there is still nothing that pins the charge to the memcg, and thus
nothing that stabilizes *stock->cached.

I agree that we can probably do better than a global lock, though.

Fully agreed. I will send a patch after I give it some testing.

Thanks!
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help