Re: [PATCH 4/4] mm: vmscan: If kswapd has been running too long, allow it to sleep
From: Minchan Kim <hidden>
Date: 2011-05-17 00:48:59
Also in:
linux-fsdevel, linux-mm, lkml
On Tue, May 17, 2011 at 8:50 AM, Minchan Kim [off-list ref] wrote:
quoted hunk ↗ jump to hunk
On Mon, May 16, 2011 at 7:27 PM, Mel Gorman [off-list ref] wrote:quoted
On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:quoted
On Mon, May 16, 2011 at 5:45 PM, Mel Gorman [off-list ref] wrote:quoted
On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:quoted
On Mon, May 16, 2011 at 1:21 PM, James Bottomley [off-list ref] wrote:quoted
On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:quoted
(2011/05/13 23:03), Mel Gorman wrote:quoted
Under constant allocation pressure, kswapd can be in the situation where sleeping_prematurely() will always return true even if kswapd has been running a long time. Check if kswapd needs to be scheduled. Signed-off-by: Mel Gorman<mgorman@suse.de> --- mm/vmscan.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-)diff --git a/mm/vmscan.c b/mm/vmscan.c index af24d1e..4d24828 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c@@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,unsigned long balanced = 0; bool all_zones_ok = true; + /* If kswapd has been running too long, just sleep */ + if (need_resched()) + return false; +Hmm... I don't like this patch so much. because this code does - don't sleep if kswapd got context switch at shrink_inactive_listThis isn't entirely true: need_resched() will be false, so we'll follow the normal path for determining whether to sleep or not, in effect leaving the current behaviour unchanged.quoted
- sleep if kswapd didn'tThis also isn't entirely true: whether need_resched() is true at this point depends on a whole lot more that whether we did a context switch in shrink_inactive. It mostly depends on how long we've been running without giving up the CPU. Generally that will mean we've been round the shrinker loop hundreds to thousands of times without sleeping.quoted
It seems to be semi random behavior.Well, we have to do something. Chris Mason first suspected the hang was a kswapd rescheduling problem a while ago. We tried putting cond_rescheds() in several places in the vmscan code, but to no avail.Is it a result of test with patch of Hannes(ie, !pgdat_balanced)? If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c. Because, although we complete zone balancing, kswapd doesn't sleep as pgdat_balance returns wrong result. And at last VM calls balance_pgdat. In this case, balance_pgdat returns without any work as kswap couldn't find zones which have not enough free pages and goto out. kswapd could repeat this work infinitely. So you don't have a chance to call cond_resched. But if your test was with Hanne's patch, I am very curious how come kswapd consumes CPU a lot.quoted
The need_resched() in sleeping_prematurely() seems to be about the best option. The other option might be just to put a cond_resched() in kswapd_try_to_sleep(), but that will really have about the same effect.I don't oppose it but before that, I think we have to know why kswapd consumes CPU a lot although we applied Hannes' patch.Because it's still possible for processes to allocate pages at the same rate kswapd is freeing them leading to a situation where kswapd does not consider the zone balanced for prolonged periods of time.We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat. So I think kswapd can be scheduled out although it's scheduled in after a short time as task scheduled also need page reclaim. Although all task in system need reclaim, kswapd cpu 99% consumption is a natural result, I think. Do I miss something?Lets see; shrink_page_list() only applies if inactive pages were isolated which in turn may not happen if all_unreclaimable is set in shrink_zones(). If for whatver reason, all_unreclaimable is set on all zones, we can miss calling cond_resched(). shrink_slab only applies if we are reclaiming slab pages. If the first shrinker returns -1, we do not call cond_resched(). If that first shrinker is dcache and __GFP_FS is not set, direct reclaimers will not shrink at all. However, if there are enough of them running or if one of the other shrinkers is running for a very long time, kswapd could be starved acquiring the shrinker_rwsem and never reaching the cond_resched().Don't we have to move cond_resched?diff --git a/mm/vmscan.c b/mm/vmscan.c index 292582c..633e761 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c@@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,if (scanned == 0) scanned = SWAP_CLUSTER_MAX; - if (!down_read_trylock(&shrinker_rwsem)) - return 1; /* Assume we'll be able to shrink next time */ + if (!down_read_trylock(&shrinker_rwsem)) { + ret = 1; + goto out; /* Assume we'll be able to shrink next time */ + } list_for_each_entry(shrinker, &shrinker_list, list) { unsigned long long delta;@@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,count_vm_events(SLABS_SCANNED, this_scan); total_scan -= this_scan; - cond_resched(); } shrinker->nr += total_scan; + cond_resched(); } up_read(&shrinker_rwsem); +out: + cond_resched(); return ret; }quoted
balance_pgdat() only calls cond_resched if the zones are not balanced. For a high-order allocation that is balanced, it checks order-0 again. During that window, order-0 might have become unbalanced so it loops again for order-0 and returns that was reclaiming for order-0 to kswapd(). It can then find that a caller has rewoken kswapd for a high-order and re-enters balance_pgdat() without ever have called cond_resched().If kswapd reclaims order-o followed by high order, it would have a chance to call cond_resched in shrink_page_list. But if all zones are all_unreclaimable is set, balance_pgdat could return any work.
Typo : without any work.
--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html