Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality

From: Hillf Danton <hidden>
Date: 2021-09-11 03:11:56

On Fri, 10 Sep 2021 17:11:05 -0700 Mike Kravetz wrote:

On 9/10/21 1:20 AM, Michal Hocko wrote:

quoted

On Thu 09-09-21 15:45:45, Vlastimil Babka wrote:

quoted

I would say it's simply should_continue_reclaim() behaving similarly to
should_compact_retry(). We'll get compaction_suitable() returning
COMPACT_SKIPPED because the reclaimed pages have been immediately stolen,
and compaction indicates there's not enough base pages to begin with to form
a high-order pages. Since the stolen pages will appear on inactive lru, it
seems to be worth continuing reclaim to make enough free base pages for
compaction to no longer be skipped, because "inactive_lru_pages >
pages_for_compaction" is true.

So, both should_continue_reclaim() and should_compact_retry() are unable to
recognize that reclaimed pages are being stolen and limit the retries in
that case. The scenario seems to be uncommon, otherwise we'd be getting more
reports of that.

Thanks this sounds like a viable explanation. For some reason I have
thought that a single reclaim round can take that much time but reading
again it seems like a cumulative thing.

I do agree that we should consider the above scenario when bailing out.
It is simply unreasonable to reclaim so much memory without any forward
progress.

A very simple patch to bail early eliminated the LONG stalls and
decreased the time of a bulk allocation request.

Recall, the use case is converting 1G huge pages to the corresponding
amount of 2MB huge pages.  I have been told that 50GB is not an uncommon
amount of pages to 'convert'.  So, test case is free 50GB hugetlb pages
followed by allocate 25600 2MB hugetlb pages.  I have not done enough
runs to get anything statistically valid, but here are some ballpark
numbers.

Unmodified baseline:
-------------------
Unexpected number of 2MB pages after demote
  Expected 25600, have 19944

real    0m42.092s
user    0m0.008s
sys     0m41.467s

Retries capped by patch below:
------------------------------
Unexpected number of 2MB pages after demote
  Expected 25600, have 19395

real    0m12.951s
user    0m0.010s
sys     0m12.840s


The time of the operation is certainly better, but do note that we
allocated 549 fewer pages.  So, bailing early might cause some failures
if we continue trying.  It is all a tradeoff.

A hard one really to reach with all parties considerations met.

One of the reasons for
considering demote functionality is that the conversion would be done in
the hugetlb code without getting the page allocator, recalim,
compaction, etc involved.

Particularly when this involvement is a must have.

Given the role of stolen pages in the stall, a bisect run is throttle
the concurrent bulk allocators before a hammer on the stall.

static int hugetlb_2M_page_allocators;
static DECLARE_WAIT_QUEUE_HEAD(hugetlb_2M_page_allocators_wq);
static DEFINE_MUTEX(hugetlb_2M_page_allocators_mutex);

static bool inc_hugetlb_2M_page_allocators(void)
{
	bool rc = false;

	mutex_lock(&hugetlb_2M_page_allocators_mutex);
	if (hugetlb_2M_page_allocators < 2) {
		rc = true;
		++hugetlb_2M_page_allocators;
	}
	mutex_unlock(&hugetlb_2M_page_allocators_mutex);

	return rc;
}

static void dec_hugetlb_2M_page_allocators(void)
{
	mutex_lock(&hugetlb_2M_page_allocators_mutex);
	--hugetlb_2M_page_allocators;
	wake_up(&hugetlb_2M_page_allocators_wq);
	mutex_unlock(&hugetlb_2M_page_allocators_mutex);
}

static struct page *throttle_hugetlb_2M_page_allocators(void)
{
	struct page *page;

	wait_event(hugetlb_2M_page_allocators_wq,
		   true == inc_hugetlb_2M_page_allocators());

	page = __alloc_pages_nodemask(gfp_mask, order2M, nid, nmask);

	dec_hugetlb_2M_page_allocators();
	return page;
}

quoted hunk ↗ jump to hunk

quoted

A more robust way to address this problem (which is not really new)
would be to privatize reclaimed pages in the direct reclaim context and
reuse them in the compaction so that it doesn't have to just
pro-actively reclaim to keep watermarks good enough. A normal low order
requests would benefit from that as well because the reclaim couldn't be
stolen from them as well.

That does sound interesting.  Perhaps a longer term project.

quoted

Another approach would be to detect the situation and treat reclaim
success which immediatelly leads to compaction deferral due to
watermarks as a failure.

I'll look into detecting and responding to this.

Simple limit retries patch:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eeb3a9c..16f3055 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c

@@ -4896,6 +4896,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)

	int no_progress_loops;
	unsigned int cpuset_mems_cookie;
	int reserve_flags;
+	unsigned long tries = 0, max_tries = 0;

	/*
	 * We also sanity check to catch abuse of atomic reserves being used by

@@ -4904,6 +4905,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)

	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
		gfp_mask &= ~__GFP_ATOMIC;
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		max_tries = 1Ul << order;

retry_cpuset:
	compaction_retries = 0;

@@ -4996,6 +4999,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)

	}

retry:
+	tries++;
	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
	if (alloc_flags & ALLOC_KSWAPD)
		wake_all_kswapds(order, gfp_mask, ac);

@@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)

	if (did_some_progress > 0 &&
			should_compact_retry(ac, order, alloc_flags,
				compact_result, &compact_priority,
-				&compaction_retries))
+				&compaction_retries)) {
+		/*
+		 * In one pathological case, pages can be stolen immediately
+		 * after reclaimed.  It looks like we are making progress, and
+		 * compaction_retries is not incremented.  This could cause
+		 * an indefinite number of retries.  Cap the number of retries
+		 * for costly orders.
+		 */
+		if (max_tries && tries > max_tries)
+			goto nopage;
		goto retry;
+	}


	/* Deal with possible cpuset update races before we start OOM killing */

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeae2f6..519e16e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c

@@ -2894,10 +2894,14 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)

	struct lruvec *target_lruvec;
	bool reclaimable = false;
	unsigned long file;
+	unsigned long tries = 0, max_tries = 0;

	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		max_tries = 1UL << sc->order;

again:
+	tries++;
	memset(&sc->nr, 0, sizeof(sc->nr));

	nr_reclaimed = sc->nr_reclaimed;

@@ -3066,9 +3070,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)

		wait_iff_congested(BLK_RW_ASYNC, HZ/10);

	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
-				    sc))
+				    sc)) {
+		/*
+		 * In one pathological case, pages can be stolen immediately
+		 * after being reclaimed.  This can cause an indefinite number
+		 * of retries.  Limit the number of retries for costly orders.
+		 */
+		if (!current_is_kswapd() &&
+		    max_tries && tries > max_tries &&
+		    sc->nr_reclaimed > sc->nr_to_reclaim)
+			goto done;
		goto again;
+	}

+done:
	/*
	 * Kswapd gives up on balancing particular nodes after too
	 * many failures to reclaim anything from them and goes to


-- 
Mike Kravetz

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help