Thread (35 messages) 35 messages, 3 authors, 2011-08-19

Re: [PATCH v3] memcg: add nr_pages argument for hierarchical reclaim

From: Michal Hocko <hidden>
Date: 2011-08-18 14:40:50
Also in: lkml

On Thu 18-08-11 15:58:21, Johannes Weiner wrote:
On Thu, Aug 18, 2011 at 02:57:54PM +0200, Michal Hocko wrote:
quoted
I have just realized that num_online_nodes should be much better than
MAX_NUMNODES. 
Just for reference, the patch is based on top of
https://lkml.org/lkml/2011/8/9/82 (it doesn't depend on it but it also
doesn't make much sense without it)

Changes since v2:
- use num_online_nodes rather than MAX_NUMNODES
Changes since v1:
- reclaim nr_nodes * SWAP_CLUSTER_MAX in mem_cgroup_force_empty
---
From: Michal Hocko <redacted>
Subject: memcg: add nr_pages argument for hierarchical reclaim

Now that we are doing memcg direct reclaim limited to nr_to_reclaim
pages (introduced by "memcg: stop vmscan when enough done.") we have to
be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for
most callers but it might cause failures for limit resize or force_empty
code paths on big NUMA machines.
The limit resizing path retries as long as reclaim makes progress, so
this is just handwaving.
limit resizing paths do not check the return value of
mem_cgroup_hierarchical_reclaim so the number of retries is not
affected. It is true that fixing that would be much easier.
After Kame's patch, the force-empty path has an increased risk of
failing to move huge pages to the parent, because it tries reclaim
only once.  This could need further evaluation, and possibly a fix.
Agreed
But instead:
quoted
@@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
+	/*
+	 * We are lying about nr_pages because we do not want to
+	 * reclaim too much for THP pages which should rather fallback
+	 * to small pages.
+	 */
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					      gfp_mask, flags, NULL);
+					      gfp_mask, flags, NULL,
+					      1);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
You tell it to reclaim _less_ than before, further increasing the risk
of failure...
quoted
@@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = !noswap,
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),
...but wait, this transparently fixes it up and ignores the caller's
request.

Sorry, but this is just horrible!
Yes, I do not like it as well and tried to point it out in the comment.
Anyway I do agree that this doesn't solve the problem you are describing
above and the limit resizing paths can be fixed much easier so the patch
is pointless.
For the past weeks I have been chasing memcg bugs that came in with
sloppy and untested code, that was merged for handwavy reasons.
Yes, I feel big responsibility about that.
Changes to algorithms need to be tested and optimizations need to be
quantified in other parts of the VM and the kernel, too.  I have no
idea why this doesn't seem to apply to the memory cgroup subsystem.
Yes, we should definitely do better during review process.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help