Re: [RFC] writeback and cgroup

From: Fengguang Wu <hidden>
Date: 2012-04-06 09:59:34
Also in: linux-fsdevel, linux-mm, lkml

Possibly related (same subject, not in this thread)

2012-04-25 · Re: [RFC] writeback and cgroup · Tejun Heo <tj@kernel.org>
2012-04-25 · Re: [RFC] writeback and cgroup · Fengguang Wu <hidden>
2012-04-25 · Re: [RFC] writeback and cgroup · Jan Kara <hidden>
2012-04-25 · Re: [RFC] writeback and cgroup · Fengguang Wu <hidden>
2012-04-25 · Re: [RFC] writeback and cgroup · Fengguang Wu <hidden>

Hi Tejun,

On Wed, Apr 04, 2012 at 12:33:55PM -0700, Tejun Heo wrote:

Hey, Fengguang.

On Wed, Apr 04, 2012 at 10:51:24AM -0700, Fengguang Wu wrote:

quoted

Yeah it should be trivial to apply the balance_dirty_pages()
throttling algorithm to the read/direct IOs. However up to now I don't
see much added value to *duplicate* the current block IO controller
functionalities, assuming the current users and developers are happy
with it.

Heh, trust me.  It's half broken and people ain't happy.  I get that

Yeah, although the balance_dirty_pages() IO controller for buffered
writes looks perfect in itself, it's not enough to meet user demands.

The user expectation should be: hey, please throttle *all* IOs from
this cgroup to this amount, either in absolute bps/iops limits or in
some proportional weight value (or both, whatever the lower takes
effect).  And if necessary, he may request further limits/weights for
each type of IO inside the cgroup.

Now the blkio cgroup supports direct IO and the balance_dirty_pages()
IO controller supports buffered writes. They are providing
limits/weights for either direct IO or buffered writes, which is fine
if it's pure direct IO or pure buffered write. For the common mixed
IO workloads, it's obviously not enough.

Fortunately, the above gap can be easily filled judging from the
block/cfq IO controller code. By adding some direct IO accounting
and changing several lines of my patches to make use of the collected
stats, the semantics of the blkio.throttle.write_bps interfaces can be
changed from "limit for direct IO" to "limit for direct+buffered IOs".
Ditto for blkio.weight and blkio.write_iops, as long as some
iops/device time stats are made available to balance_dirty_pages().

It would be a fairly *easy* change. :-) It's merely adding some
accounting code and there is no need to change the block IO
controlling algorithm at all. I'll do the work of accounting (which
is basically independent of the IO controlling) and use the new stats
in balance_dirty_pages().

The only problem I can see now, is that balance_dirty_pages() works
per-bdi and blkcg works per-device. So the two ends may not match
nicely if the user configures lv0 on sda+sdb and lv1 on sdb+sdc where
sdb is shared by lv0 and lv1. However it should be rare situations and
be much more acceptable than the problems arise from the "push back"
approach which impacts everyone.

your algorithm can be updatd to consider all IOs and I believe that
but what I don't get is how would such information get to writeback
and in turn how writeback would enforce the result on reads and direct
IOs.  Through what path?  Will all reads and direct IOs travel through
balance_dirty_pages() even direct IOs on raw block devices?  Or would
the writeback algorithm take the configuration from cfq, apply the
algorithm and give back the limits to enforce to cfq?  If the latter,
isn't that at least somewhat messed up?

cfq is working well and don't need any modifications. Let's just make
balance_dirty_pages() cgroup aware and fill the gap of the current
block IO controller.

If the balance_dirty_pages() throttling algorithms will ever be
applied to read and direct IOs, it would be for NFS, CIFS etc. Even
for them, there may be better throttling choices. For example, Trond
mentioned the RPC layer to me during the summit.

quoted

I did the buffered write IO controller mainly to fill the gap.  If I
happen to stand in your way, sorry that's not my initial intention.

No, no, it's not about standing in my way.  As Vivek said in the other
reply, it's that the "gap" that you filled was created *because*
writeback wasn't cgroup aware and now you're in turn filling that gap
by making writeback work around that "gap".  I mean, my mind boggles.
Doesn't yours?  I strongly believe everyone's should.

Heh. It's a hard problem indeed. I felt great pains in the IO-less
dirty throttling work. I did a lot reasoning about it, and have in
fact kept cgroup IO controller in mind since its early days. Now I'd
say it's hands down for it to adapt to the gap between the total IO
limit and what's carried out by the block IO controller.

quoted

It's a pity and surprise that Google as a big user does not buy in
this simple solution. You may prefer more comprehensive controls which
may not be easily achievable with the simple scheme. However the
complexities and overheads involved in throttling the flusher IOs
really upsets me.

Heh, believe it or not, I'm not really wearing google hat on this
subject and google's writeback people may have completely different
opinions on the subject than mine.  In fact, I'm not even sure how
much "work" time I'll be able to assign to this.  :(

OK, understand.

quoted

The sweet split point would be for balance_dirty_pages() to do cgroup
aware buffered write throttling and leave other IOs to the current
blkcg. For this to work well as a total solution for end users, I hope
we can cooperate and figure out ways for the two throttling entities
to work well with each other.

There's where I'm confused.  How is the said split supposed to work?
They aren't independent.  I mean, who gets to decide what and where
are those decisions enforced?

Yeah it's not independent. It's about

- keep block IO cgroup untouched (in its current algorithm, for
  throttling direct IO)

- let balance_dirty_pages() adapt to the throttling target
  
        buffered_write_limit = total_limit - direct_IOs

quoted

What I'm interested is, what's Google and other users' use schemes in
practice. What's their desired interfaces. Whether and how the
combined bdp+blkcg throttling can fulfill the goals.

I'm not too privy of mm and writeback in google and even if so I
probably shouldn't talk too much about it.  Confidentiality and all.
That said, I have the general feeling that goog already figured out
how to at least work around the existing implementation and would be
able to continue no matter how upstream development fans out.

That said, wearing the cgroup maintainer and general kernel
contributor hat, I'd really like to avoid another design mess up.

To me it looks a pretty clean split and find it to be an easy
solution (after sorting it out the hard way). I'll show the code and
test results after some time.

quoted

Let's please keep the layering clear.  IO limitations will be applied
at the block layer and pressure will be formed there and then
propagated upwards eventually to the originator.  Sure, exposing the
whole information might result in better behavior for certain
workloads, but down the road, say, in three or five years, devices
which can be shared without worrying too much about seeks might be
commonplace and we could be swearing at a disgusting structural mess,
and sadly various cgroup support seems to be a prominent source of
such design failures.

Super fast storages are coming which will make us regret to make the
IO path over complex.  Spinning disks are not going away anytime soon.
I doubt Google is willing to afford the disk seek costs on its
millions of disks and has the patience to wait until switching all of
the spin disks to SSD years later (if it will ever happen).

This is new.  Let's keep the damn employer out of the discussion.
While the area I work on is affected by my employment (writeback isn't
even my area BTW), I'm not gonna do something adverse to upstream even
if it's beneficial to google and I'm much more likely to do something
which may hurt google a bit if it's gonna benefit upstream.

As for the faster / newer storage argument, that is *exactly* why we
want to keep the layering proper.  Writeback works from the pressure
from the IO stack.  If IO technology changes, we update the IO stack
and writeback still works from the pressure.  It may need to be
adjusted but the principles don't change.

To me, balance_dirty_pages() is *the* proper layer for buffered writes.
It's always there doing 1:1 proportional throttling. Then you try to
kick in to add *double* throttling in block/cfq layer. Now the low
layer may enforce 10:1 throttling and push balance_dirty_pages() away
from its balanced state, leading to large fluctuations and program
stalls.  This can be avoided by telling balance_dirty_pages(): "your
balance goal is no longer 1:1, but 10:1". With this information
balance_dirty_pages() will behave right. Then there is the question:
if balance_dirty_pages() will work just well provided the information,
why bother doing the throttling at low layer and "push back" the
pressure all the way up?

quoted

It's obvious that your below proposal involves a lot of complexities,
overheads, and will hurt performance. It basically involves

Hmmm... that's not the impression I got from the discussion.
According to Jan, applying the current writeback logic to cgroup'fied
bdi shouldn't be too complex, no?

In the sense of "avoidable" complexity :-)

quoted

- running concurrent flusher threads for cgroups, which adds back the
  disk seeks and lock contentions. And still has problems with sync
  and shared inodes.

I agree this is an actual concern but if the user wants to split one
spindle to multiple resource domains, there's gonna be considerable
amount of overhead no matter what.  If you want to improve how block
layer handles the split, you're welcome to dive into the block layer,
where the split is made, and improve it.

quoted

- splitting device queue for cgroups, possibly scaling up the pool of
  writeback pages (and locked pages in the case of stable pages) which
  could stall random processes in the system

Sure, it'll take up more buffering and memory but that's the overhead
of the cgroup business.  I want it to be less intrusive at the cost of
somewhat more resource consumption.  ie. I don't want writeback logic
itself deeply involved in block IO cgroup enforcement even if that
means somewhat less efficient resource usage.

The balance_dirty_pages() is already deeply involved in dirty throttling.
As you can see from this patchset, the same algorithms can be extended
trivially to work with cgroup IO limits.

buffered write IO controller in balance_dirty_pages()
https://lkml.org/lkml/2012/3/28/275

It does not require forking off the flusher threads and splitting up
the IO queue at all.

quoted

- the mess of metadata handling

Does throttling from writeback actually solve this problem?  What
about fsync()?  Does that already go through balance_dirty_pages()?

balance_dirty_pages() does throttling at safe points outside of fs
transactions/locks.

fsync() only submits IO for already dirtied pages and won't be
throttled by balance_dirty_pages(). Throttling happens at earlier
times when the task is dirtying the pages.

quoted

- unnecessarily coupled with memcg, in order to take advantage of the
  per-memcg dirty limits for balance_dirty_pages() to actually convert
  the "pushed back" dirty pages pressure into lowered dirty rate. Why
  the hell the users *have to* setup memcg (suffering from all the
  inconvenience and overheads) in order to do IO throttling?  Please,
  this is really ugly! And the "back pressure" may constantly push the
  memcg dirty pages to the limits. I'm not going to support *miss use*
  of per-memcg dirty limits like this!

Writeback sits between blkcg and memcg and it indeed can be hairy to
consider both sides especially given the current sorry complex state
of cgroup and I can see why it would seem tempting to add a separate
controller or at least knobs to support that.  That said, I *think*
given that memcg controls all other memory parameters it probably
would make most sense giving that parameter to memcg too.  I don't
think this is really relevant to this discussion tho.  Who owns
dirty_limits is a separate issue.

In the "back pressure" scheme, memcg is a must because only it has all
the infrastructure to track dirty pages upon which you can apply some
dirty_limits. Don't tell me you want to account dirty pages in blkcg...

quoted

I cannot believe you would keep overlooking all the problems without
good reasons. Please do tell us the reasons that matter.

Well, I tried and I hope some of it got through.  I also wrote a lot
of questions, mainly regarding how what you have in mind is supposed
to work through what path.  Maybe I'm just not seeing what you're
seeing but I just can't see where all the IOs would go through and
come together.  Can you please elaborate more on that?

What I can see is, it looks pretty simple and nature to let
balance_dirty_pages() fill the gap towards a total solution :-)

- add direct IO accounting in some convenient point of the IO path
  IO submission or completion point, either is fine.

- change several lines of the buffered write IO controller to
  integrate the direct IO rate into the formula to fit the "total
  IO" limit

- in future, add more accounting as well as feedback control to make
  balance_dirty_pages() work with IOPS and disk time

Thanks,
Fengguang

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help