Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)

From: Wu Fengguang <hidden>
Date: 2011-03-25 13:44:11
Also in: linux-fsdevel

Hi Jan,

On Wed, Mar 23, 2011 at 05:43:14AM +0800, Jan Kara wrote:

  Hello Fengguang,

On Fri 18-03-11 22:30:01, Wu Fengguang wrote:

quoted

On Wed, Mar 09, 2011 at 06:31:10AM +0800, Jan Kara wrote:

quoted

  Hello,

  I'm posting second version of my IO-less balance_dirty_pages() patches. This
is alternative approach to Fengguang's patches - much simpler I believe (only
300 lines added) - but obviously I does not provide so sophisticated control.

Well, it may be too early to claim "simplicity" as an advantage, until
you achieve the following performance/feature comparability (most of
them are not optional ones). AFAICS this work is kind of heavy lifting
that will consume a lot of time and attention. You'd better find some
more fundamental needs before go on the reworking.

(1)  latency
(2)  fairness
(3)  smoothness
(4)  scalability
(5)  per-task IO controller
(6)  per-cgroup IO controller (TBD)
(7)  free combinations of per-task/per-cgroup and bandwidth/priority controllers
(8)  think time compensation
(9)  backed by both theory and tests
(10) adapt pause time up on 100+ dirtiers
(11) adapt pause time down on low dirty pages 
(12) adapt to new dirty threshold/goal
(13) safeguard against dirty exceeding
(14) safeguard against device queue underflow

  I think this is a misunderstanding of my goals ;). My main goal is to
explore, how far we can get with a relatively simple approach to IO-less
balance_dirty_pages(). I guess what I have is better than the current
balance_dirty_pages() but it sure does not even try to provide all the
features you try to provide.

OK.

I'm thinking about tweaking ratelimiting logic to reduce latencies in some
tests, possibly add compensation when we waited for too long in
balance_dirty_pages() (e.g. because of bumpy IO completion) but that's
about it...

Basically I do this so that we can compare and decide whether what my
simple approach offers is OK or whether we want some more complex solution
like your patches...

Yeah, now both results are on the website. Let's see whether they are
acceptable for others.

quoted

The basic idea (implemented in the third patch) is that processes throttled
in balance_dirty_pages() wait for enough IO to complete. The waiting is
implemented as follows: Whenever we decide to throttle a task in
balance_dirty_pages(), task adds itself to a list of tasks that are throttled
against that bdi and goes to sleep waiting to receive specified amount of page
IO completions. Once in a while (currently HZ/10, in patch 5 the interval is
autotuned based on observed IO rate), accumulated page IO completions are
distributed equally among waiting tasks.

This waiting scheme has been chosen so that waiting time in
balance_dirty_pages() is proportional to
  number_waited_pages * number_of_waiters.
In particular it does not depend on the total number of pages being waited for,
thus providing possibly a fairer results.

When there comes no IO completion in 1 second (normal in NFS), the
tasks will all get stuck. It is fixable based on your v2 code base
(detailed below), however will likely bring the same level of
complexity as the base bandwidth solution.

  I have some plans how to account for bumpy IO completion when we wait for
a long time and then get completion of much more IO than we actually need.
But in case where processes use all the bandwidth and the latency of the
device is high, sure they take the penalty and have to wait for a long time
in balance_dirty_pages().

No, I don't think it's good to block for long time in
balance_dirty_pages(). This seems to be our biggest branch point.

quoted

As for v2, there are still big gap to fill. NFS dirtiers are
constantly doing 20-25 seconds long delays

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/jan-bdp-v2b/3G/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc8-jan-bdp+-2011-03-10-16-31/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/jan-bdp-v2b/3G/nfs-10dd-1M-8p-2945M-20%25-2.6.38-rc8-jan-bdp+-2011-03-10-16-38/balance_dirty_pages-pause.png

  Yeah, this is because they want lots of pages each
(3/2*MAX_WRITEBACK_PAGES). I'll try to change ratelimiting to make several
shorter sleeps. But ultimately you have to wait this much. Just you can
split those big sleeps in more of smaller ones.

Ideally I prefer less than 100ms sleep time for achieving smooth
responsiveness and IO pipeline.

quoted

and the tasks are bumping forwards

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/jan-bdp-v2b/3G/nfs-10dd-1M-8p-2945M-20%25-2.6.38-rc8-jan-bdp+-2011-03-10-16-38/balance_dirty_pages-task-bw.png

  Yeah, that's a result of bumpy NFS writeout and basically the consequence
of the above. Maybe it can be helped but I don't find this to be a problem on
its own...

ditto.

quoted

Since last version I've implemented cleanups as suggested by Peter Zilstra.
The patches undergone more throughout testing. So far I've tested different
filesystems (ext2, ext3, ext4, xfs, nfs), also a combination of a local
filesystem and nfs. The load was either various number of dd threads or
fio with several threads each dirtying pages at different speed.

Results and test scripts can be found at
  http://beta.suse.com/private/jack/balance_dirty_pages-v2/
See README file for some explanation of test framework, tests, and graphs.
Except for ext3 in data=ordered mode, where kjournald creates high
fluctuations in waiting time of throttled processes (and also high latencies),
the results look OK. Parallel dd threads are being throttled in the same way
(in a 2s window threads spend the same time waiting) and also latencies of
individual waits seem OK - except for ext3 they fit in 100 ms for local
filesystems. They are in 200-500 ms range for NFS, which isn't that nice but
to fix that we'd have to modify current ratelimiting scheme to take into
account on which bdi a page is dirtied. Then we could ratelimit slower BDIs
more often thus reducing latencies in individual waits...

Yes the per-cpu rate limit is a problem, so I'm switching to per-task
rate limit.

  BTW: Have you considered per-bdi ratelimiting? Both per-task and per-bdi
make sense just they are going to have slightly different properties...
Current per-cpu ratelimit counters tend to behave like per-task
ratelimiting at least for fast dirtiers because once a task is blocked in
balance_dirty_pages() another task runs on that cpu and uses the counter
for itself. So I wouldn't expect big differences from per-task
ratelimiting...

Good point. It should be enough to have per-bdi rate limiting threshold.
However I still need to keep per-task nr_dirtied for doing per-task
throttling bandwidth. The per-cpu bdp_ratelimits mix dirtied pages
from different tasks and bdis, which is unacceptable for me.

quoted

The direct input from IO completion is another issue. It leaves the
dirty tasks at the mercy of low layer (VFS/FS/bdev) fluctuations and
latencies. So I'm introducing the base bandwidth as a buffer layer.
You may employ the similar technique: to simulate a more smooth flow
of IO completion events based on the average write bandwidth. Then it
naturally introduce the problem of rate mismatch between
simulated/real IO completions, and the need to do more elaborated
position control.

  Exacttly, that's why I don't want to base throttling on some computed
value (well, I also somehow estimate necessary sleep time but that's more a
performance optimization) but rather leave tasks "at the mercy of lower
layers" as you write ;) I don't think it's necessarily a bad thing.

Again, the same branch point :)

quoted

The results for different bandwidths fio load is interesting. There are 8
threads dirtying pages at 1,2,4,..,128 MB/s rate. Due to different task
bdi dirty limits, what happens is that three most aggresive tasks get
throttled so they end up at bandwidths 24, 26, and 30 MB/s and the lighter
dirtiers run unthrottled.

The base bandwidth based throttling can do better and provide almost
perfect fairness, because all tasks writing to one bdi derive their
own throttle bandwidth based on the same per-bdi base bandwidth. So
the heavier dirtiers will converge to equal dirty rate and weight.

  So what do you consider a perfect fairness in this case and are you sure
it is desirable? I was thinking about this and I'm not sure...

Perfect fairness could be 1, 2, 4, 8, N, N, N MB/s, where

        N = (write_bandwidth - 1 - 2 - 4 - 8) / 3.

I guess its usefulness is largely depending on the user space
applications.  Most of them should not be sensible to it.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help