Re: [PATCH v4 1/1] mm: vmscan: Reduce throttling due to a failure to make progress

From: Mel Gorman <hidden>
Date: 2021-12-03 19:08:13
Also in: linux-fsdevel, lkml, regressions

On Fri, Dec 03, 2021 at 09:50:51AM -0800, Shakeel Butt wrote:

On Fri, Dec 3, 2021 at 1:01 AM Mel Gorman [off-list ref] wrote:

quoted

[...]

quoted

Not recently that I'm aware of but historically reclaim has been plagued by
at least two classes of problems -- premature OOM and excessive CPU usage
churning through the LRU. Going back, the solution was basically to sleep
something like "disable kswapd if it fails to make progress for too long".
Commit 69392a403f49 addressed a case where calling congestion_wait might as
well have been schedule_timeout_uninterruptible(HZ/10) because congestion
is no longer tracked by the block layer.

Hence 69392a403f49 allows reclaim to throttle on NOPROGRESS but if
another task makes progress, the throttled tasks can be woken before the
timeout. The flaw was throttling too easily or for too long delaying OOM
being properly detected.

To remove congestion_wait of mem_cgroup_force_empty_write(), the
commit 69392a403f49 has changed the behavior of all memcg reclaim
codepaths as well as direct global reclaimers.

Well, yes, it moved it to stalling on writeback if it's in progress
and waking up if writeback makes forward progress instead of a
schedule_timeout_interruptible().

Were there other
congestion_wait() instances which commit 69392a403f49 was targeting
but those congestion_wait() were replaced/removed by different
commits?

Yes, the series removed congestion_wait from other places because the
interface is broken and has been for a long time.

[...]

quoted

Isn't it better that the reclaim returns why it is failing instead of
littering the reclaim code with 'is this global reclaim', 'is this
memcg reclaim', 'am I kswapd' which is also a layering violation. IMO
this is the direction we should be going towards though not asking to
do this now.

It's not clear why you think the page allocator can make better decisions
about reclaim than reclaim can. It might make sense if callers were
returned enough information to make a decision but even if they could,
it would not be popular as the API would be difficult to use properly.

The above is a separate discussion for later.

quoted

Is your primary objection the cgroup_reclaim(sc) check?

No, I am of the opinion that we should revert 69392a403f49 and we
should have just replaced congestion_wait in
mem_cgroup_force_empty_write with a simple
schedule_timeout_interruptible.

That is a bit weak. Depending on the type of storage, writeback may
completes in microseconds or seconds. The event used to be "sleep until
congestion clears" which is no longer an event that can be waited upon
in the vast majority of cases (NFS being an obvious exception). Now,
it may throttle writeback on a waitqueue and if enough writeback
completes, the task will be woken before the timeout to minimise the
stall. schedule_timeout_interruptible() always waits for a fixed duration
regardless of what else happens in the meantime.

The memory.force_empty is a cgroup v1
interface (to be deprecated) and it is very normal to expect that the
user will trigger that interface multiple times. We should not change
the behavior of all the memcg reclaimers and direct global reclaimers
so that we can remove congestion_wait from
mem_cgroup_force_empty_write.

The mem_cgroup_force_empty_write() path will throttle on writeback in
the same way that global reclaim does at this point

                /*
                 * If kswapd scans pages marked for immediate
                 * reclaim and under writeback (nr_immediate), it
                 * implies that pages are cycling through the LRU
                 * faster than they are written so forcibly stall
                 * until some pages complete writeback.
                 */
                if (sc->nr.immediate)
                        reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);

With this patch, memcg does not stall on NOPROGRESS.

quoted

If so, I can
remove it. While there is a mild risk that OOM would be delayed, it's very
unlikely because a memcg failing to make progress in the local case will
probably call cond_resched() if there are not lots of of pages pending
writes globally.

quoted

Regarding this patch and 69392a403f49, I am still confused on the main
motivation behind 69392a403f49 to change the behavior of 'direct
reclaimers from page allocator'.

The main motivation of the series overall was to remove the reliance on
congestion_wait and wait_iff_congested because both are fundamentally
broken when congestion is not tracked by the block layer. Replacing with
schedule_timeout_uninterruptible() would be silly because where possible
decisions on whether to pause or throttle should be based on events,
not time. For example, if there are too many pages waiting on writeback
then throttle but if writeback completes, wake the throttled tasks
instead of "sleep some time and hope for the best".

I am in agreement with the motivation of the whole series. I am just
making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based
throttle is more than just the congestion_wait of
mem_cgroup_force_empty_write.

The commit that primarily targets congestion_wait is 8cd7c588decf
("mm/vmscan: throttle reclaim until some writeback completes if
congested"). The series recognises that there are other reasons why
reclaim can fail to make progress that is not directly writeback related.

-- 
Mel Gorman
SUSE Labs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help