Re: [PATCH] jbd jbd2: fix dio write returning EIOwhentry_to_release_page fails

From: Hisashi Hifumi <hidden>
Date: 2008-08-07 03:15:14
Also in: linux-fsdevel

At 07:57 08/08/07, Mingming Cao wrote:

蝨ィ 2008-08-06荳臥噪 15:53 +0200�繰an Kara蜀咎％��

quoted

On Wed 06-08-08

09:25:13, Chris Mason wrote:

quoted

On Tue, 2008-08-05 at 14:17 -0700, Mingming Cao wrote:

quoted

蝨ィ 2008-08-05莠檎噪 12:17 -0400�靴hris Mason蜀咎％��

quoted

On

Tue, 2008-08-05 at 13:51 +0900, Hisashi Hifumi wrote:

quoted

diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c

linux-2.6.27-rc1/fs/jbd/transaction.c

quoted

--- linux-2.6.27-rc1.org/fs/jbd/transaction.c	2008-07-29

19:28:47.000000000 +0900

quoted

+++ linux-2.6.27-rc1/fs/jbd/transaction.c	2008-07-29

20:40:12.000000000 +0900

quoted

@@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
 	*/
 	if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & 

__GFP_FS)) {

quoted

 		journal_wait_for_transaction_sync_data(journal);
+
+		bh = head;
+		do {
+			while (atomic_read(&bh->b_count))
+				schedule();
+		} while ((bh = bh->b_this_page) != head);
 		ret = try_to_free_buffers(page);
 	}

The loop is problematic.  If the scheduler decides to keep

running this

quoted

quoted
quoted
quoted
quoted
quoted
quoted
task then we have a busy loop.  If this task has realtime 

policy then

quoted

it might even lock up the kernel.

ocfs2 calls journal_try_to_free_buffers too, looping on b_count might
not be the best idea there either.

This code gets called from releasepage, which is used other

places than

quoted

the O_DIRECT invalidation paths, I'd be worried about performance
problems here.

try_to_release_page has gfp_mask parameter. So when try_to_releasepage
is called from performance sensitive part, gfp_mask should not be set.
b_count check loop is inside of (gfp_mask & __GFP_WAIT) &&

(gfp_mask & __GFP_FS) check.

quoted

Looks like try_to_free_pages will go into releasepage with wait & fs
both set.  This kind of change would make me very nervous.

Hi Chris,

The gfp_mask try_to_free_pages() takes from it's caller will past it
down to try_to_release_page().  Based on the meaning of __GFP_WAIT and
GFP_FS, if the upper level caller set these two flags,  I assume the
upper level caller expect delay and wait for fs to finish?


But I agree that using a loop in journal_try_to_free_buffers() to wait
for the busy bh release the counter is expensive...

I rediscovered your old thread about trying to do this in a launder_page
call ;)

  Yes, we thought about using launder_page() before :).

quoted

Does it make more sense to fix do_launder_page to call into the FS on
every page, and let the FS check for PageDirty on its own?  That way
invalidate_inode_pages2_range basically gets its own private call into
the FS that says wait around until this page is really free.

  That would certainly work as well. But IMHO waiting for ->writepage()
call to finish isn't really a big deal even in try_to_release_page() if
__GFP_FS (and __GFP_WAIT) is set. The only problem is that there is no
effective way to do so and so Hisashi used that "wait for b_count to drop"
which looks really scary and I don't like it as well.

I was  looking at the comment in invalidate_complete_page2(), which is
now only called from DIO path, it saids

/*
* This is like invalidate_complete_page(), except it ignores the page's
* refcount.  We do this because invalidate_inode_pages2() needs
stronger
* invalidation guarantees, and cannot afford to leave pages behind
because
* shrink_page_list() has a temp ref on them, or because they're
transiently
* sitting in the lru_cache_add() pagevecs.
*/


I am wondering why we need stronger invalidate hurantees for DIO->
invalidate_inode_pages_range(),which force the page being removed from
page cache? In case of bh is busy due to ext3 writeout,
journal_try_to_free_buffers() could return different error number(EBUSY)
to try_to_releasepage() (instead of EIO).  In that case,  could we just
leave the page in the cache, clean pageuptodate() (to force later buffer
read to read from disk) and then invalidate_complete_page2() return
successfully? Any issue with this way?

My idea is that journal_try_to_free_buffers returns EBUSY if it fails due to
bh busy, and dio write falls back to buffered write. This is easy to fix.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help