Re: [PATCH] backing_dev_info: introduce min_bw/max_bw limits

From: Michael Stapelberg <hidden>
Date: 2021-06-22 12:29:46
Also in: linux-fsdevel, lkml

Thanks for taking a look! Comments inline:

On Tue, 22 Jun 2021 at 14:12, Jan Kara [off-list ref] wrote:

On Mon 21-06-21 11:20:10, Michael Stapelberg wrote:

quoted

Hey Miklos

On Fri, 18 Jun 2021 at 16:42, Miklos Szeredi [off-list ref] wrote:

quoted

On Fri, 18 Jun 2021 at 10:31, Michael Stapelberg
[off-list ref] wrote:

quoted

Maybe, but I don’t have the expertise, motivation or time to
investigate this any further, let alone commit to get it done.
During our previous discussion I got the impression that nobody else
had any cycles for this either:
https://lore.kernel.org/linux-fsdevel/CANnVG6n=ySfe1gOr=0ituQidp56idGARDKHzP0hv=ERedeMrMA@mail.gmail.com/ (local)

Have you had a look at the China LSF report at
http://bardofschool.blogspot.com/2011/?
The author of the heuristic has spent significant effort and time
coming up with what we currently have in the kernel:

"""
Fengguang said he draw more than 10K performance graphs and read even
more in the past year.
"""

This implies that making changes to the heuristic will not be a quick fix.

Having a piece of kernel code sitting there that nobody is willing to
fix is certainly not a great situation to be in.

Agreed.

quoted

And introducing band aids is not going improve the above situation,
more likely it will prolong it even further.

Sounds like “Perfect is the enemy of good” to me: you’re looking for a
perfect hypothetical solution,
whereas we have a known-working low risk fix for a real problem.

Could we find a solution where medium-/long-term, the code in question
is improved,
perhaps via a Summer Of Code project or similar community efforts,
but until then, we apply the patch at hand?

As I mentioned, I think adding min/max limits can be useful regardless
of how the heuristic itself changes.

If that turns out to be incorrect or undesired, we can still turn the
knobs into a no-op, if removal isn’t an option.

Well, removal of added knobs is more or less out of question as it can
break some userspace. Similarly making them no-op is problematic unless we
are pretty certain it cannot break some existing setup. That's why we have
to think twice (or better three times ;) before adding any knobs. Also
honestly the knobs you suggest will be pretty hard to tune when there are
multiple cgroups with writeback control involved (which can be affected by
the same problems you observe as well). So I agree with Miklos that this is
not the right way to go. Speaking of tunables, did you try tuning
/sys/devices/virtual/bdi/<fuse-bdi>/min_ratio? I suspect that may
workaround your problems...

Back then, I did try the various tunables (vm.dirty_ratio and
vm.dirty_background_ratio on the global level,
/sys/class/bdi/<bdi>/{min,max}_ratio on the file system level), and
they have had no observable effect on the problem at all in my tests.

Looking into your original report and tracing you did (thanks for that,
really useful), it seems that the problem is that writeback bandwidth is
updated at most every 200ms (more frequent calls are just ignored) and are
triggered only from balance_dirty_pages() (happen when pages are dirtied) and
inode writeback code so if the workload tends to have short spikes of activity
and extended periods of quiet time, then writeback bandwidth may indeed be
seriously miscomputed because we just won't update writeback throughput
after most of writeback has happened as you observed.

I think the fix for this can be relatively simple. We just need to make
sure we update writeback bandwidth reasonably quickly after the IO
finishes. I'll write a patch and see if it helps.

Thank you! Please keep us posted.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help