Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim

From: Mel Gorman <hidden>
Date: 2010-02-23 16:23:31

On Tue, Feb 23, 2010 at 12:55:51PM +1100, Anton Blanchard wrote:

 
Hi Mel,

I'm afraid I'm on vacation at the moment. This mail is costing me shots with
penaltys every minute it's open.  It'll be early next week before I can look
at this closely.

Sorry.

quoted

You're pretty much on the button here. Only one thread at a time enters
zone_reclaim. The others back off and try the next zone in the zonelist
instead. I'm not sure what the original intention was but most likely it
was to prevent too many parallel reclaimers in the same zone potentially
dumping out way more data than necessary.

quoted

I'm not sure if there is an easy way to fix this without penalising other
workloads though.

You could experiment with waiting on the bit if the GFP flags allowi it? The
expectation would be that the reclaim operation does not take long. Wait
on the bit, if you are making the forward progress, recheck the
watermarks before continueing.

Thanks to you and Christoph for some suggestions to try. Attached is a
chart showing the results of the following tests:


baseline.txt
The current ppc64 default of zone_reclaim_mode = 0. As expected we see
no change in remote node memory usage even after 10 iterations.

zone_reclaim_mode.txt
Now we set zone_reclaim_mode = 1. On each iteration we continue to improve,
but even after 10 runs of stream we have > 10% remote node memory usage.

reclaim_4096_pages.txt
Instead of reclaiming 32 pages at a time, we try for a much larger batch
of 4096. The slope is much steeper but it still takes around 6 iterations
to get almost all local node memory.

wait_on_busy_flag.txt
Here we busy wait if the ZONE_RECLAIM_LOCKED flag is set. As you suggest
we would need to check the GFP flags etc, but so far it looks the most
promising. We only get a few percent of remote node memory on the first
iteration and get all local node by the second.


Perhaps a combination of larger batch size and waiting on the busy
flag is the way to go?

Anton

quoted hunk ↗ jump to hunk

--- mm/vmscan.c~	2010-02-21 23:47:14.000000000 -0600
+++ mm/vmscan.c	2010-02-22 03:22:01.000000000 -0600

@@ -2534,7 +2534,7 @@
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,
-				       SWAP_CLUSTER_MAX),
+				       4096),
 		.gfp_mask = gfp_mask,
 		.swappiness = vm_swappiness,
 		.order = order,

quoted hunk ↗ jump to hunk

--- mm/vmscan.c~	2010-02-21 23:47:14.000000000 -0600
+++ mm/vmscan.c	2010-02-21 23:47:31.000000000 -0600

@@ -2634,8 +2634,8 @@
 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
 		return ZONE_RECLAIM_NOSCAN;
 
-	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
-		return ZONE_RECLAIM_NOSCAN;
+	while (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
+		cpu_relax();
 
 	ret = __zone_reclaim(zone, gfp_mask, order);
 	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help