Re: [patch] mm: memcontrol: support transparent huge pages under pressure

From: Johannes Weiner <hannes@cmpxchg.org>
Date: 2014-09-23 11:17:13
Also in: linux-mm, lkml
Subsystem: control group - memory resource controller (memcg), memory management, the rest · Maintainers: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton, Linus Torvalds

On Mon, Sep 22, 2014 at 10:52:50PM -0700, Greg Thelen wrote:

On Fri, Sep 19 2014, Johannes Weiner wrote:

quoted

In a memcg with even just moderate cache pressure, success rates for
transparent huge page allocations drop to zero, wasting a lot of
effort that the allocator puts into assembling these pages.

The reason for this is that the memcg reclaim code was never designed
for higher-order charges.  It reclaims in small batches until there is
room for at least one page.  Huge pages charges only succeed when
these batches add up over a series of huge faults, which is unlikely
under any significant load involving order-0 allocations in the group.

Remove that loop on the memcg side in favor of passing the actual
reclaim goal to direct reclaim, which is already set up and optimized
to meet higher-order goals efficiently.

This brings memcg's THP policy in line with the system policy: if the
allocator painstakingly assembles a hugepage, memcg will at least make
an honest effort to charge it.  As a result, transparent hugepage
allocation rates amid cache activity are drastically improved:

                                      vanilla                 patched
pgalloc                 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
pgfault                  491370.60 (  +0.00%)    225477.40 ( -54.11%)
pgmajfault                    2.00 (  +0.00%)         1.80 (  -6.67%)
thp_fault_alloc               0.00 (  +0.00%)       531.60 (+100.00%)
thp_fault_fallback          749.00 (  +0.00%)       217.40 ( -70.88%)

[ Note: this may in turn increase memory consumption from internal
  fragmentation, which is an inherent risk of transparent hugepages.
  Some setups may have to adjust the memcg limits accordingly to
  accomodate this - or, if the machine is already packed to capacity,
  disable the transparent huge page feature. ]

We're using an earlier version of this patch, so I approve of the
general direction.  But I have some feedback.

The memsw aspect of this change seems somewhat separate.  Can it be
split into a different patch?

The memsw aspect of this patch seems to change behavior.  Is this
intended?  If so, a mention of it in the commit log would assuage the
reader.  I'll explain...  Assume a machine with swap enabled and
res.limit==memsw.limit, thus memsw_is_minimum is true.  My understanding
is that memsw.usage represents sum(ram_usage, swap_usage).  So when
memsw_is_minimum=true, then both swap_usage=0 and
memsw.usage==res.usage.  In this condition, if res usage is at limit
then there's no point in swapping because memsw.usage is already
maximal.  Prior to this patch I think the kernel did the right thing,
but not afterwards.

Before this patch:
  if res.usage == res.limit, try_charge() indirectly calls
  try_to_free_mem_cgroup_pages(noswap=true)

After this patch:
  if res.usage == res.limit, try_charge() calls
  try_to_free_mem_cgroup_pages(may_swap=true)

Notice the inverted swap-is-allowed value.

For some reason I had myself convinced that this is dead code due to a
change in callsites a long time ago, but you are right that currently
try_charge() relies on it, thanks for pointing it out.

However, memsw is always equal to or bigger than the memory limit - so
instead of keeping a separate state variable to track when memory
failure implies memsw failure, couldn't we just charge memsw first?

How about the following?  But yeah, I'd split this into a separate
patch now.

---
 mm/memcontrol.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e2def11f1ec1..7c9a8971d0f4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c

@@ -2497,16 +2497,17 @@ retry:
 		goto done;
 
 	size = batch * PAGE_SIZE;
-	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
-		if (!do_swap_account)
+	if (!do_swap_account ||
+	    !res_counter_charge(&memcg->memsw, size, &fail_res)) {
+		if (!res_counter_charge(&memcg->res, size, &fail_res))
 			goto done_restock;
-		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
-			goto done_restock;
-		res_counter_uncharge(&memcg->res, size);
+		if (do_swap_account)
+			res_counter_uncharge(&memcg->memsw, size);
+		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+	} else {
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
 		may_swap = false;
-	} else
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+	}
 
 	if (batch > nr_pages) {
 		batch = nr_pages;

-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help