Re: [PATCH v3] memcg: add nr_pages argument for hierarchical reclaim
From: Michal Hocko <hidden>
Date: 2011-08-18 14:40:50
Also in:
lkml
On Thu 18-08-11 15:58:21, Johannes Weiner wrote:
On Thu, Aug 18, 2011 at 02:57:54PM +0200, Michal Hocko wrote:quoted
I have just realized that num_online_nodes should be much better than MAX_NUMNODES. Just for reference, the patch is based on top of https://lkml.org/lkml/2011/8/9/82 (it doesn't depend on it but it also doesn't make much sense without it) Changes since v2: - use num_online_nodes rather than MAX_NUMNODES Changes since v1: - reclaim nr_nodes * SWAP_CLUSTER_MAX in mem_cgroup_force_empty --- From: Michal Hocko <redacted> Subject: memcg: add nr_pages argument for hierarchical reclaim Now that we are doing memcg direct reclaim limited to nr_to_reclaim pages (introduced by "memcg: stop vmscan when enough done.") we have to be more careful. Currently we are using SWAP_CLUSTER_MAX which is OK for most callers but it might cause failures for limit resize or force_empty code paths on big NUMA machines.The limit resizing path retries as long as reclaim makes progress, so this is just handwaving.
limit resizing paths do not check the return value of mem_cgroup_hierarchical_reclaim so the number of retries is not affected. It is true that fixing that would be much easier.
After Kame's patch, the force-empty path has an increased risk of failing to move huge pages to the parent, because it tries reclaim only once. This could need further evaluation, and possibly a fix.
Agreed
But instead:quoted
@@ -2331,8 +2331,14 @@ static int mem_cgroup_do_charge(struct m if (!(gfp_mask & __GFP_WAIT)) return CHARGE_WOULDBLOCK; + /* + * We are lying about nr_pages because we do not want to + * reclaim too much for THP pages which should rather fallback + * to small pages. + */ ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL, - gfp_mask, flags, NULL); + gfp_mask, flags, NULL, + 1); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) return CHARGE_RETRY; /*You tell it to reclaim _less_ than before, further increasing the risk of failure...quoted
@@ -2350,7 +2351,7 @@ unsigned long try_to_free_mem_cgroup_pag .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = !noswap, - .nr_to_reclaim = SWAP_CLUSTER_MAX, + .nr_to_reclaim = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX),...but wait, this transparently fixes it up and ignores the caller's request. Sorry, but this is just horrible!
Yes, I do not like it as well and tried to point it out in the comment. Anyway I do agree that this doesn't solve the problem you are describing above and the limit resizing paths can be fixed much easier so the patch is pointless.
For the past weeks I have been chasing memcg bugs that came in with sloppy and untested code, that was merged for handwavy reasons.
Yes, I feel big responsibility about that.
Changes to algorithms need to be tested and optimizations need to be quantified in other parts of the VM and the kernel, too. I have no idea why this doesn't seem to apply to the memory cgroup subsystem.
Yes, we should definitely do better during review process. -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>