Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
From: Mel Gorman <hidden>
Date: 2010-02-23 16:23:31
On Tue, Feb 23, 2010 at 12:55:51PM +1100, Anton Blanchard wrote:
Hi Mel,
I'm afraid I'm on vacation at the moment. This mail is costing me shots with penaltys every minute it's open. It'll be early next week before I can look at this closely. Sorry.
quoted
You're pretty much on the button here. Only one thread at a time enters zone_reclaim. The others back off and try the next zone in the zonelist instead. I'm not sure what the original intention was but most likely it was to prevent too many parallel reclaimers in the same zone potentially dumping out way more data than necessary.quoted
I'm not sure if there is an easy way to fix this without penalising other workloads though.You could experiment with waiting on the bit if the GFP flags allowi it? The expectation would be that the reclaim operation does not take long. Wait on the bit, if you are making the forward progress, recheck the watermarks before continueing.Thanks to you and Christoph for some suggestions to try. Attached is a chart showing the results of the following tests: baseline.txt The current ppc64 default of zone_reclaim_mode = 0. As expected we see no change in remote node memory usage even after 10 iterations. zone_reclaim_mode.txt Now we set zone_reclaim_mode = 1. On each iteration we continue to improve, but even after 10 runs of stream we have > 10% remote node memory usage. reclaim_4096_pages.txt Instead of reclaiming 32 pages at a time, we try for a much larger batch of 4096. The slope is much steeper but it still takes around 6 iterations to get almost all local node memory. wait_on_busy_flag.txt Here we busy wait if the ZONE_RECLAIM_LOCKED flag is set. As you suggest we would need to check the GFP flags etc, but so far it looks the most promising. We only get a few percent of remote node memory on the first iteration and get all local node by the second. Perhaps a combination of larger batch size and waiting on the busy flag is the way to go? Anton
quoted hunk ↗ jump to hunk
--- mm/vmscan.c~ 2010-02-21 23:47:14.000000000 -0600 +++ mm/vmscan.c 2010-02-22 03:22:01.000000000 -0600@@ -2534,7 +2534,7 @@ .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), .may_swap = 1, .nr_to_reclaim = max_t(unsigned long, nr_pages, - SWAP_CLUSTER_MAX), + 4096), .gfp_mask = gfp_mask, .swappiness = vm_swappiness, .order = order,
quoted hunk ↗ jump to hunk
--- mm/vmscan.c~ 2010-02-21 23:47:14.000000000 -0600 +++ mm/vmscan.c 2010-02-21 23:47:31.000000000 -0600@@ -2634,8 +2634,8 @@ if (node_state(node_id, N_CPU) && node_id != numa_node_id()) return ZONE_RECLAIM_NOSCAN; - if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) - return ZONE_RECLAIM_NOSCAN; + while (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) + cpu_relax(); ret = __zone_reclaim(zone, gfp_mask, order); zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
-- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab