Thread (15 messages) 15 messages, 5 authors, 2009-09-01

Re: [PATCH, RFC] vm: Add an tuning knob for vm.max_writeback_pages

From: Christoph Hellwig <hch@infradead.org>
Date: 2009-08-30 22:27:10
Also in: linux-fsdevel, linux-mm

On Sun, Aug 30, 2009 at 02:17:31PM -0400, Theodore Tso wrote:
quoted
The current writeback sizes are defintively too small, we shoved in
a hack into XFS to bump up nr_to_write to four times the value the
VM sends us to be able to saturate medium sized RAID arrays in XFS.
Hmm, should we make it be a per-superblock tunable so that it can
either be tuned on a per-block device basis or the filesystem code can
adjust it to their liking?  I thought about it, but decided maybe it
was better to keeping it simple.
I'm don't think tuning it on a per-filesystem basis is a good idea,
we had to resort to this for 2.6.30 as a quick hack, and we will ged
rid of it again in 2.6.31 one way or another.  I personally think we
should fight this cancer of per-filesystem hacks in the writeback code
as much as we can.  Right now people keep adding tuning hacks for
specific workloads there, and at least all the modern filesystems (ext4,
btrfs and XFS) have very similar requirements to the writeback code,
that is give the filesystem as much as possible to write at the same
time to do intelligent decisions based on that.  The VM writeback code
fails horribly at that right now.
quoted
Turns out this was not enough and at least for Chris Masons array
we only started seaturating at * 16.  I suspect you patch will give
a similar effect.
So you think 16384 would be a better default?  The reason why I picked
32768 was because that was the size of the ext4 block group, but it
was otherwise it was totally arbitrary.  I haven't done any
benchmarking yet, which is one of the reasons why I thought about
making it a tunable.
It was just another arbitrary number.  My suspicion is that the exact
number does not matter, it just needs to be much much larger than the
current one.  And the latency concerns are also over-rated as the block
layer will tell us that we are congested if we push too much into a
queue.
quoted
And the other big question is how this interacts with Jens' new per-bdi
flushing code that we still hope to merge in 2.6.32.
Jens?  What do you think?  Fixing MAX_WRITEBACK_PAGES was something I
really wanted to merge in 2.6.32 since it makes a huge difference for
the block allocation layout for a "rsync -avH /old-fs /new-fs" when we
are copying bunch of large files (say, 800 meg iso images) and so the
fact that the writeback routine is writing out 4 megs at a time, means
that our files get horribly interleaved and thus get fragmented.

I initially thought about adding some massive workarounds in the
filesystem layer (which is I guess what XFS did),
XFS is relatively good at doing the disk block layout even with smaller
writeouts, so we do not have that fragmentation hack.  And the
for-2.6.30 big hack is one liner:
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1268,6 +1268,14 @@ xfs_vm_writepage(
 	if (!page_has_buffers(page))
 		create_empty_buffers(page, 1 << inode->i_blkbits, 0);
 
+
+	/*
+	 *  VM calculation for nr_to_write seems off.  Bump it way
+	 *  up, this gets simple streaming writes zippy again.
+	 *  To be reviewed again after Jens' writeback changes.
+	 */
+	wbc->nr_to_write *= 4;
+
 	/*
 	 * Convert delayed allocate, unwritten or unmapped space
 	 * to real space and flush out to disk.
This also went past -fsdevel, but I didn' manage to get the flames for
adding these hacks that I hoped to get ;-)

My stance is to wait for this until about -rc2 at which points Jens'
code is hopefully in and we can start doing all the fine-tuning,
including lots of benchmarking.

Btw, one thing I would really see from one of the big companies or
the LF is doing benchmarks like yours above or just a simple one or
two stream dd on some big machines weekly so we can immediately see
regressions once someones starts to tweak the VM again.  Preferably
including seekwatcher data.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help