Re: [PATCH v4 1/1] mm: vmscan: Reduce throttling due to a failure to make progress
From: Mel Gorman <hidden>
Date: 2021-12-07 09:28:11
Also in:
linux-fsdevel, lkml, regressions
On Mon, Dec 06, 2021 at 11:14:58PM -0800, Shakeel Butt wrote:
On Mon, Dec 6, 2021 at 3:25 AM Mel Gorman [off-list ref] wrote:quoted
On Sun, Dec 05, 2021 at 10:06:27PM -0800, Shakeel Butt wrote:quoted
On Fri, Dec 3, 2021 at 11:08 AM Mel Gorman [off-list ref] wrote:quoted
[...]quoted
quoted
I am in agreement with the motivation of the whole series. I am just making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based throttle is more than just the congestion_wait of mem_cgroup_force_empty_write.The commit that primarily targets congestion_wait is 8cd7c588decf ("mm/vmscan: throttle reclaim until some writeback completes if congested"). The series recognises that there are other reasons why reclaim can fail to make progress that is not directly writeback related.I agree with throttling for VMSCAN_THROTTLE_[WRITEBACK|ISOLATED] reasons. Please explain why we should throttle for VMSCAN_THROTTLE_NOPROGRESS? Also 69392a403f49 claims "Direct reclaim primarily is throttled in the page allocator if it is failing to make progress.", can you please explain how?It could happen if the pages on the LRU are being reactivated continually or holding an elevated reference count for some reason (e.g. gup, page migration etc). The event is probably transient, hence the short throttling.What's the worst that can happen if the kernel doesn't throttle at all for these transient scenarios? Premature oom-kills?
Excessive CPU usage in reclaim, potential premature OOM kills.
The kernel already has some protection against such situations with retries i.e. consecutive 16 unsuccessful reclaim tries have to fail to give up the reclaim.
The retries mitigate the premature OOM kills but not the excessive CPU usage.
Anyways, I have shared my view which is 'no need to throttle at all for no-progress reclaims for now and course correct if there are complaints in future' but will not block the patch.
We've gone through periods of bugs that had either direct reclaim or kswapd pegged at 100% CPU usage. While kswapd now just stops, the patch still minimises the risk of excessive CPU usage bugs due to direct reclaim. -- Mel Gorman SUSE Labs