Re: ext4 out of order when use cfq scheduler

From: Jan Kara <jack@suse.cz>
Date: 2016-03-15 20:09:30

Possibly related (same subject, not in this thread)

2016-02-03 · RE: ext4 out of order when use cfq scheduler · HUANG Weller (CM/ESW12-CN) <hidden>
2016-01-28 · Re: ext4 out of order when use cfq scheduler · Xiong Zhou <hidden>
2016-01-07 · RE: ext4 out of order when use cfq scheduler · HUANG Weller (CM/ESW12-CN) <hidden>
2016-01-06 · Re: ext4 out of order when use cfq scheduler · Andreas Dilger <hidden>
2016-01-06 · RE: ext4 out of order when use cfq scheduler · HUANG Weller (CM/ESW12-CN) <hidden>

On Tue 15-03-16 15:46:33, Jan Kara wrote:

On Tue 15-03-16 11:46:34, Jan Kara wrote:

quoted

On Mon 14-03-16 10:36:35, Ted Tso wrote:

quoted

On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:

quoted

No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
anything to IOs which have not reported completion before
blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
in its internal RB tree, following flush request completely bypasses this
tree and goes directly to the disk where it flushes caches. And only later
CFQ decides to schedule async writeback from the flusher thread which is
queued in the RB tree...

Oh, right.  I am forgetting about the flushing mahchinery rewrite.
Thanks for pointing that out.

But what we *could* do is to swap those two calls and then in the case
where delalloc is enabled, could maintain a list of inodes where we
only need to call filemap_fdatawait(), and not initiate writeback for
any dirty pages which had been caused by non-allocating writes.

We actually don't need to swap those two calls - page is already marked as
under writeback in

  mpage_map_and_submit_buffers() -> mpage_submit_page -> ext4_bio_write_page

which gets called while we still hold the transaction handle. I agree
calling filemap_fdatawait() from JBD2 during commit should be enough to fix
issues with delalloc writeback. I'm just somewhat afraid that it will be
more fragile: If we add inode to transaction's list in ext4_map_blocks(),
we are pretty sure there's no way to allocate block to an inode without
introducing data exposure issues (which are then very hard to spot). If we
depend on callers of ext4_map_blocks() to properly add inode to appropriate
transaction list, we have much more places to check. I'll think whether we
could make this more robust.

OK, I have something - Huang, can you check whether the attached patches
also fix your data exposure issues please? The first patch is the original
fix, patch two is a cleanup, patches 3 and 4 implement the speedup
suggested by Ted. Patches are only lightly tested so far.  I'll run more
comprehensive tests later and in particular I want to check whether the
additional complexity actually brings us some advantage at least for
workloads which redirty pages in addition to writing some new ones using
delayed allocation.

OK, there was a bug in patch 3. Attached is a new version of patches 3 and
4.
							Honza

Attachments

0003-jbd2-Add-support-for-avoiding-data-writes-during-tra.patch [text/x-patch] 7472 bytes · preview
0004-ext4-Do-not-ask-jbd2-to-write-data-for-delalloc-buff.patch [text/x-patch] 4272 bytes · preview

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help