Re: [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch

From: Dave Hansen <hidden>
Date: 2021-05-21 21:52:43
Also in: lkml

On 5/21/21 3:28 AM, Mel Gorman wrote:

Note that in this patch the pcp->high values are adjusted after memory
hotplug events, min_free_kbytes adjustments and watermark scale factor
adjustments but not CPU hotplug events.

Not that it was a long wait to figure it out, but I'd probably say:

	"CPU hotplug events are handled later in the series".

instead of just saying they're not handled.

Before grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  378
              batch: 63

After grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  649
              batch: 63

You noted the relationship between pcp->high and zone lock contention.
Larger ->high values mean less contention.  It's probably also worth
noting the trend of having more logical CPUs per NUMA node.

I have the feeling when this was put in place it wasn't uncommon to have
somewhere between 1 and 8 CPUs in a node pounding on a zone.

Today, having ~60 is common.  I've occasionally resorted to recommending
that folks enable hardware features like Sub-NUMA-Clustering [1] since
it increases the number of zones and decreases the number of CPUs
pounding on each zone lock.

1.
https://software.intel.com/content/www/us/en/develop/articles/intel-xeon-processor-scalable-family-technical-overview.html

quoted hunk ↗ jump to hunk

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a48f305f0381..bf5cdc466e6c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c

@@ -2163,14 +2163,6 @@ void __init page_alloc_init_late(void)
 	/* Block until all are initialised */
 	wait_for_completion(&pgdat_init_all_done_comp);
 
-	/*
-	 * The number of managed pages has changed due to the initialisation
-	 * so the pcpu batch and high limits needs to be updated or the limits
-	 * will be artificially small.
-	 */
-	for_each_populated_zone(zone)
-		zone_pcp_update(zone);
-
 	/*
 	 * We initialized the rest of the deferred pages.  Permanently disable
 	 * on-demand struct page initialization.

@@ -6594,13 +6586,12 @@ static int zone_batchsize(struct zone *zone)
 	int batch;
 
 	/*
-	 * The per-cpu-pages pools are set to around 1000th of the
-	 * size of the zone.
+	 * The number of pages to batch allocate is either 0.1%

Probably worth making that "~0.1%" just in case someone goes looking for
the /1000 and can't find it.

quoted hunk ↗ jump to hunk

+	 * of the zone or 1MB, whichever is smaller. The batch
+	 * size is striking a balance between allocation latency
+	 * and zone lock contention.
 	 */
-	batch = zone_managed_pages(zone) / 1024;
-	/* But no more than a meg. */
-	if (batch * PAGE_SIZE > 1024 * 1024)
-		batch = (1024 * 1024) / PAGE_SIZE;
+	batch = min(zone_managed_pages(zone) >> 10, (1024 * 1024) / PAGE_SIZE);
 	batch /= 4;		/* We effectively *= 4 below */
 	if (batch < 1)
 		batch = 1;

@@ -6637,6 +6628,27 @@ static int zone_batchsize(struct zone *zone)
 #endif
 }
 
+static int zone_highsize(struct zone *zone)
+{
+#ifdef CONFIG_MMU
+	int high;
+	int nr_local_cpus;
+
+	/*
+	 * The high value of the pcp is based on the zone low watermark
+	 * when reclaim is potentially active spread across the online
+	 * CPUs local to a zone. Note that early in boot that CPUs may
+	 * not be online yet.
+	 */

FWIW, I like the way the changelog talked about this a bit better, with
the goal of avoiding background reclaim even in the face of a bunch of
full pcp's.

+	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
+	high = low_wmark_pages(zone) / nr_local_cpus;

I'm a little concerned that this might get out of hand on really big
nodes with no CPUs.  For persistent memory (which we *do* toss into the
page allocator for volatile use), we can have multi-terabyte zones with
no CPUs in the node.

Also, while the CPUs which are on the node are the ones *most* likely to
be hitting the ->high limit, we do *keep* a pcp for each possible CPU.
So, the amount of memory which can actually be sequestered is
num_online_cpus()*high.  Right?

*That* might really get out of hand if we have nr_local_cpus=1.

We might want some overall cap on 'high', or even to scale it
differently for the zone-local cpus' pcps versus remote.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help