Re: [PATCH RFC] mm: Implement balance_dirty_pages() through waiting for flusher thread
From: Jan Kara <jack@suse.cz>
Date: 2010-06-22 11:20:15
Also in:
linux-fsdevel
On Tue 22-06-10 15:44:09, Dave Chinner wrote:
On Tue, Jun 22, 2010 at 01:36:29AM +0200, Jan Kara wrote:quoted
On Fri 18-06-10 16:09:01, Dave Chinner wrote:quoted
On Thu, Jun 17, 2010 at 08:04:38PM +0200, Jan Kara wrote:quoted
This patch changes balance_dirty_pages() throttling so that the function does not submit writes on its own but rather waits for flusher thread to do enough writes. This has an advantage that we have a single source of IO allowing for better writeback locality. Also we do not have to reenter filesystems from a non-trivial context. The waiting is implemented as follows: Each BDI has a FIFO of wait requests. When balance_dirty_pages() finds it needs to throttle the writer, it adds a request for writing write_chunk of pages to the FIFO and waits until the request is fulfilled or we drop below dirty limits. A flusher thread tracks amount of pages it has written and when enough pages are written since the first request in the FIFO became first, it wakes up the waiter and removes request from the FIFO (thus another request becomes first and flusher thread starts writing out on behalf of this request). CC: hch@infradead.org CC: akpm@linux-foundation.org CC: peterz@infradead.org CC: wfg@mail.ustc.edu.cn Signed-off-by: Jan Kara <jack@suse.cz>
...
quoted
quoted
quoted
+ bdi->wb_written_head = bdi_stat(bdi, BDI_WRITTEN) + wc->written;The resolution of the percpu counters is an issue here, I think. percpu counters update in batches of 32 counts per CPU. wc->written is going to have a value of roughly 8 or 32 depending on whether bdi->dirty_exceeded is set or not. I note that you take this into account when checking dirty threshold limits, but it doesn't appear to be taken in to here.Hmm, are you sure about the number of pages? I think that the ratelimits you speak about influence only how often we *check* the limits. Looking at sync_writeback_pages() I see: static inline long sync_writeback_pages(unsigned long dirtied) { if (dirtied < ratelimit_pages) dirtied = ratelimit_pages; return dirtied + dirtied / 2; }It's declared as: 39 /* 40 * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited 41 * will look to see if it needs to force writeback or throttling. 42 */ 43 static long ratelimit_pages = 32;
Yeah, I know, it's a bit confusing...
quoted
So it returns at least ratelimit_pages * 3/2. Now ratelimit_pages is computed in writeback_set_ratelimit() and for most machines I expect the math there to end up with ratelimit_pages == 1024. So we enter balance_dirty_pages with number 1536...But I missed this. That means we typically will check every 1024 pages until dirty_exceeded is set, then we'll check every 8 pages.
Yes, with dirty_exceeded we check every 8 pages but still require 1536 pages written.
quoted
That being said, the error of percpu counters can still be significant - with 16 CPUs doing completion the average error is 384 (on your max_cpus == 512 machine - our distro kernels have that too).The error gets larger as the number of CPUs goes up - Peter pointed that out because: #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids))) So for a system reporting nr_cpu_ids = 32 (as my 8 core nehalem server currently reports), the error could be 16 * 48 = 768.quoted
quoted
That means for ratelimits of 8 pages on writes and only one CPU doing IO completion wakeups are only going to occur in completion batches of 32 pages. we'll end up with something like: waiter, written count at head count when woken A, 8 n n + 32 B, 8 n + 32 n + 64 C, 8 n + 64 n + 96 ..... And so on. This isn't necessarily bad - we'll throttle for longer than we strictly need to - but the cumulative counter resolution error gets worse as the number of CPUs doing IO completion grows. Worst case ends up at for (num cpus * 31) + 1 pages of writeback for just the first waiter. For an arbitrary FIFO queue of depth d, the worst case is more like d * (num cpus * 31 + 1).Hmm, I don't see how the error would depend on the FIFO depth.It's the cumulative error that depends on the FIFO depth, not the error seen by a single waiter.quoted
I'd rather think that the number of written pages after which a thread is woken is a random variable equally distributed in a range <wanted_nr - max_counter_err, wanted_nr + max_counter_err>.Right, that the error range seenby a single waiter. i.e. between bdi->wb_written_head = bdi_stat(written) + wc->pages_to_wait_for; If we only want to wait for, say, 10 pages from a counter value of N, the per-cpu counter increments in batches of BDI_STAT_BATCH. Hence the first time we see bdi_stat(written) pass bdi->wb_written_head will be when it reaches (N + BDI_STAT_BATCH). We will have waited (BDI_STAT_BATCH - 10) pages more than we needed to. i.e the wait time is a minimum of roundup(pages_to_wait_for, BDI_STAT_BATCH).quoted
And over time (or with more threads in the queue) these errors cancel out... Now as I wrote above the error for single waiter might be just too large to accept but I just wanted to argue here that I don't see how it would be getting worse with more waiters.That's where I think you go wrong - the error occurs when we remove the waiter at the head of the queue, so each subsequent waiter has a resolution error introduced. These will continue to aggregate as long as there are more waiters on the queue. e.g if we've got 3 waiters of 10 pages each, they should all be woken after 30 pages have been cleared. Instead, the first is only woken after the counter ticks (BDI_STAT_BATCH), then the second is queued, and it doesn't get woken for another counter tick. Similarly for the third waiter. When the third waiter is woken, it has waited for 3 * BDI_STAT_BATCH pages. In reality, what we realy wanted was all three waiters to be woken by the time 30 pages had been written back. i.e. the last waiter waited for (3 * (BDI_STAT_BATCH - 10)), which in general terms gives a wait overshoot for any given waiter of roughly (queue depth when queued * BDI_STAT_BATCH * nr_cpu_ids) pages.
Ah, you are right... Thanks for the correction.
However, if we calculate the wakeup time when we queue, each waiter ends up with a wakeup time of N + 10, N + 20, and N + 30, and the counter tick resolution is taken out of the picture. i.e. if the tick is 8, then the first wakeup occurs after 2 counter ticks, the second is woken after 3 counter ticks and the last is woken after 4 counter ticks. If the counter tick is >= 32, then all three will be woken on the first counter tick. In this case, the worst case wait overshoot for any waiter regardless of queue depth is (BDI_STAT_BATCH * nr_cpu_ids) pages.
Yes.
quoted
if a system gets below dirty limits for a short while so that only a few threads notice is an arguable reason), you would need to recompute them or live with waiting unnecessarily long.I'd suggest that we shouldn't be waking waiters prematurely if we've dropped below the dirty threshold - we need some hysteresis in the system to prevent oscillating rapidly around the threshold. If we are not issuing IO, then if we don't damp the wakeup frequency we could easily thrash around the threshold.
The current logic is that we do not wake up threads when the system gets below limit. But threads wake up after a certain time passes, recheck dirty limits, and quit waiting if limits are not exceeded. This is quite similar to the old behavior and should provide a reasonably smooth start.
I'd prefer to strip down the mechanism to the simplest, most accurate form before trying to add "interesting" tweaks like faster wakeup semantics at dirty thresholds....
I am a bit reluctant to not stop waiting when the system is below dirty limits for two reasons: a) it's a bit more unfair because the thread has to wait longer (until enough is written) while other threads happily perform writes because the system isn't above dirty limits anymore. b) we require 1536 pages written regardless of how many pages were dirtied. Now in theory, if there are say 8 threads writing, the threads would require 12288 pages written in total. But when a system has just 32768 pages, the dirty limit is set at 6553 pages so there even cannot be that many dirty pages in the system. Of course, we could trim down the number of pages we wait for when we aren't doing actual IO with new balance_dirty_pages() (which would reduce scope of the above problem). But still we shouldn't set it too low because all that checking, sleeping, and waking costs something so longer sleeps are beneficial from this POV. So all in all I would prefer to keep some 'backup' waking mechanism in place... Honza -- Jan Kara [off-list ref] SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>