Thread (65 messages) 65 messages, 13 authors, 2012-12-19

Re: kswapd craziness in 3.7

From: Johannes Weiner <hannes@cmpxchg.org>
Date: 2012-12-10 16:39:26
Also in: lkml

On Mon, Dec 10, 2012 at 11:03:37AM +0000, Mel Gorman wrote:
On Sat, Dec 08, 2012 at 05:01:42PM -0800, Linus Torvalds wrote:
quoted
On Sat, 8 Dec 2012, Zlatko Calusic wrote:
quoted
Or sooner... in short: nothing's changed!

On a 4GB RAM system, where applications use close to 2GB, kswapd likes to keep
around 1GB free (unused), leaving only 1GB for page/buffer cache. If I force
bigger page cache by reading a big file and thus use the unused 1GB of RAM,
kswapd will soon (in a matter of minutes) evict those (or other) pages out and
once again keep unused memory close to 1GB.
Ok, guys, what was the reclaim or kswapd patch during the merge window 
that actually caused all of these insane problems?
I believe commit c6543459 (mm: remove __GFP_NO_KSWAPD) is the primary
candidate. __GFP_NO_KSWAPD was originally introduced by THP because kswapd
was excessively reclaiming. kswapd would stay awake aggressively reclaiming
even if compaction was deferred. The flag was removed in this cycle when it
was expected that it was no longer necessary. I'm not foisting the blame
on Rik here, I was on the review list for that patch and did not identify
that it would cause this many problems either.
quoted
It seems it was more 
fundamentally buggered than the fifteen-million fixes for kswapd we have 
already picked up.
It was already fundamentally buggered up. The difference was it stayed
asleep for THP requests in earlier kernels.

There is a big difference between a direct reclaim/compaction for THP
and kswapd doing the same work. Direct reclaim/compaction will try once,
give up quickly and defer requests in the near future to avoid impacting
the system heavily for THP. The same applies for khugepaged.

kswapd is different. It can keep going until it meets its watermarks for
a THP allocation are met. Two reasons why it might keep going for a long
time are that compaction is being inefficient which we know it may be due
to crap like this

end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);

and the second reason is if the highest zone is relatively because
compaction_suitable will keep saying that allocations are failing due to
insufficient amounts of memory in the highest zone. It'll reclaim a little
from this highest zone and then shrink_slab() potentially dumping a large
amount of memory. This may be the case for Zlatko as with a 4G machine
his ZONE_NORMAL could be small depending on how the 32-bit address space
is used by his hardware.
Unlike direct reclaim, kswapd also never does sync migration.  Since
the fragmentation index is a ratio of free pages over free page
blocks, doing lightweight compaction that reduces the page blocks but
never really follows through to compact a THP block increases the free
memory requirement.

I thought about the small Normal zone too.  Direct reclaim/compaction
is fine with one zone being able to provide a THP, but kswapd requires
25% of the node.  A small ZONE_NORMAL would not be able to meet this
and so the bigger DMA32 zone would also be required to be balanced for
the THP allocation.
quoted
Mel? Ideas?
Consider reverting the revert of __GFP_NO_KSWAPD again until this can be
ironed out at a more reasonable pace. Rik? Johannes?
Yes, I also think we need more time for this.
quoted hunk ↗ jump to hunk
Verify if the shrinking slab is the issue with this brutually ugly
hack. Zlatko?
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7ed376..2189d20 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2550,6 +2550,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	unsigned long balanced;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
+	bool should_shrink_slab = true;
 	unsigned long total_scanned;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_soft_reclaimed;
@@ -2695,7 +2696,8 @@ loop_again:
 				shrink_zone(zone, &sc);
 
 				reclaim_state->reclaimed_slab = 0;
-				nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
+				if (should_shrink_slab)
+					nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
 				sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 				total_scanned += sc.nr_scanned;
 
@@ -2817,6 +2819,16 @@ out:
 	if (order) {
 		int zones_need_compaction = 1;
 
+		/*
+		 * Shrinking slab for high-order allocs can cause an excessive
+		 * amount of memory to be dumped. Only shrink slab once per
+		 * round for high-order allocs.
+		 *
+		 * This is a very stupid hack. balance_pgdat() is in serious
+		 * need of a rework
+		 */
+		should_shrink_slab = false;
+
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
I don't see a shrink_slab() invocation after this point since the
loop_again jumps in this loop where removed, so this shouldn't change
anything?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help