Re: [RFC] writeback and cgroup
From: Fengguang Wu <hidden>
Date: 2012-04-18 07:58:14
Also in:
linux-fsdevel, linux-mm, lkml
Possibly related (same subject, not in this thread)
- 2012-04-25 · Re: [RFC] writeback and cgroup · Tejun Heo <tj@kernel.org>
- 2012-04-25 · Re: [RFC] writeback and cgroup · Fengguang Wu <hidden>
- 2012-04-25 · Re: [RFC] writeback and cgroup · Jan Kara <hidden>
- 2012-04-25 · Re: [RFC] writeback and cgroup · Fengguang Wu <hidden>
- 2012-04-25 · Re: [RFC] writeback and cgroup · Fengguang Wu <hidden>
On Wed, Apr 18, 2012 at 08:57:20AM +0200, Jan Kara wrote:
On Fri 06-04-12 02:59:34, Wu Fengguang wrote: ...quoted
quoted
quoted
quoted
Let's please keep the layering clear. IO limitations will be applied at the block layer and pressure will be formed there and then propagated upwards eventually to the originator. Sure, exposing the whole information might result in better behavior for certain workloads, but down the road, say, in three or five years, devices which can be shared without worrying too much about seeks might be commonplace and we could be swearing at a disgusting structural mess, and sadly various cgroup support seems to be a prominent source of such design failures.Super fast storages are coming which will make us regret to make the IO path over complex. Spinning disks are not going away anytime soon. I doubt Google is willing to afford the disk seek costs on its millions of disks and has the patience to wait until switching all of the spin disks to SSD years later (if it will ever happen).This is new. Let's keep the damn employer out of the discussion. While the area I work on is affected by my employment (writeback isn't even my area BTW), I'm not gonna do something adverse to upstream even if it's beneficial to google and I'm much more likely to do something which may hurt google a bit if it's gonna benefit upstream. As for the faster / newer storage argument, that is *exactly* why we want to keep the layering proper. Writeback works from the pressure from the IO stack. If IO technology changes, we update the IO stack and writeback still works from the pressure. It may need to be adjusted but the principles don't change.To me, balance_dirty_pages() is *the* proper layer for buffered writes. It's always there doing 1:1 proportional throttling. Then you try to kick in to add *double* throttling in block/cfq layer. Now the low layer may enforce 10:1 throttling and push balance_dirty_pages() away from its balanced state, leading to large fluctuations and program stalls. This can be avoided by telling balance_dirty_pages(): "your balance goal is no longer 1:1, but 10:1". With this information balance_dirty_pages() will behave right. Then there is the question: if balance_dirty_pages() will work just well provided the information, why bother doing the throttling at low layer and "push back" the pressure all the way up?Fengguang, maybe we should first agree on some basics: The two main goals of balance_dirty_pages() are (and always have been AFAIK) to limit amount of dirty pages in memory and keep enough dirty pages in memory to allow for efficient writeback. Secondary goals are to also keep amount of dirty pages somewhat fair among bdis and processes. Agreed?
Agreed. In fact, before the IO-less change, balance_dirty_pages() had no much explicit control over the dirty rate and fairness.
Thus shift to trying to control *IO throughput* (or even just buffered write throughput) from balance_dirty_pages() is a fundamental shift in the goals of balance_dirty_pages(), not just some tweak (although technically, it might be relatively easy to do for buffered writes given the current implementation).
Yes, it has been a bit shift to the rate based dirty control.
...quoted
quoted
Well, I tried and I hope some of it got through. I also wrote a lot of questions, mainly regarding how what you have in mind is supposed to work through what path. Maybe I'm just not seeing what you're seeing but I just can't see where all the IOs would go through and come together. Can you please elaborate more on that?What I can see is, it looks pretty simple and nature to let balance_dirty_pages() fill the gap towards a total solution :-) - add direct IO accounting in some convenient point of the IO path IO submission or completion point, either is fine. - change several lines of the buffered write IO controller to integrate the direct IO rate into the formula to fit the "total IO" limit - in future, add more accounting as well as feedback control to make balance_dirty_pages() work with IOPS and disk timeSorry Fengguang but I also think this is a wrong way to go. balance_dirty_pages() must primarily control the amount of dirty pages. Trying to bend it to control IO throughput by including direct IO and reads in the accounting will just make the logic even more complex than it already is.
Right, I have been adding too much complexity to balance_dirty_pages(). The control algorithms are pretty hard to understand and get right for all cases. OK, I'll post results of my experiments up to now, answer some questions and take a comfortable break. Phooo.. Thanks, Fengguang