Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made

From: Mel Gorman <hidden>
Date: 2021-11-24 10:32:25
Also in: linux-fsdevel, lkml
Subsystem: memory management, memory management - mglru (multi-gen lru), memory management - reclaim, the rest · Maintainers: Andrew Morton, Johannes Weiner, Linus Torvalds

On Tue, Nov 23, 2021 at 05:19:12PM -0800, Darrick J. Wong wrote:

On Fri, Oct 22, 2021 at 03:46:46PM +0100, Mel Gorman wrote:

quoted

Memcg reclaim throttles on congestion if no reclaim progress is made.
This makes little sense, it might be due to writeback or a host of
other factors.

For !memcg reclaim, it's messy. Direct reclaim primarily is throttled
in the page allocator if it is failing to make progress. Kswapd
throttles if too many pages are under writeback and marked for
immediate reclaim.

This patch explicitly throttles if reclaim is failing to make progress.

Hi Mel,

Ever since Christoph broke swapfiles, I've been carrying around a little
fstest in my dev tree[1] that tries to exercise paging things in and out
of a swapfile.  Sadly I've been trapped in about three dozen customer
escalations for over a month, which means I haven't been able to do much
upstream in weeks.  Like submit this test upstream. :(

Now that I've finally gotten around to trying out a 5.16-rc2 build, I
notice that the runtime of this test has gone from ~5s to 2 hours.
Among other things that it does, the test sets up a cgroup with a memory
controller limiting the memory usage to 25MB, then runs a program that
tries to dirty 50MB of memory.  There's 2GB of memory in the VM, so
we're not running reclaim globally, but the cgroup gets throttled very
severely.

Ok, so this test cannot make progress until some of the cgroup pages get
cleaned. What is the expectation for the test? Should it OOM or do you
expect it to have spin-like behaviour until some writeback completes?
I'm guessing you'd prefer it to spin and right now it's sleeping far
too much.

AFAICT the system is mostly idle, but it's difficult to tell because ps
and top also get stuck waiting for this cgroup for whatever reason.

But this is surprising because I expect that ps and top are not running
within the cgroup. Was /proc/PID/stack readable?

My
uninformed spculation is that usemem_and_swapoff takes a page fault
while dirtying the 50MB memory buffer, prepares to pull a page in from
swap, tries to evict another page to stay under the memcg limit, but
that decides that it's making no progress and calls
reclaim_throttle(..., VMSCAN_THROTTLE_NOPROGRESS).

The sleep is uninterruptible, so I can't even kill -9 fstests to shut it
down.  Eventually we either finish the test or (for the mlock part) the
OOM killer actually kills the process, but this takes a very long time.

The sleep can be interruptible.

Any thoughts?  For now I can just hack around this by skipping
reclaim_throttle if cgroup_reclaim() == true, but that's probably not
the correct fix. :)

No, it wouldn't be but a possibility is throttling for only 1 jiffy if
reclaiming within a memcg and the zone is balanced overall.

The interruptible part should just be the patch below. I need to poke at
the cgroup limit part a bit

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..07db03883062 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c

@@ -1068,7 +1068,7 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 		break;
 	}
 
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
 	ret = schedule_timeout(timeout);
 	finish_wait(wqh, &wait);

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help