Thread (33 messages) 33 messages, 6 authors, 2008-08-21

Re: [PATCH] jbd jbd2: fix dio write returning EIOwhentry_to_release_page fails

From: Hisashi Hifumi <hidden>
Date: 2008-08-07 03:15:14
Also in: linux-fsdevel

At 07:57 08/08/07, Mingming Cao wrote:
蝨ィ 2008-08-06荳臥噪 15:53 +0200�繰an Kara蜀咎%��
quoted
On Wed 06-08-08 
09:25:13, Chris Mason wrote:
quoted
quoted
On Tue, 2008-08-05 at 14:17 -0700, Mingming Cao wrote:
quoted
蝨ィ 2008-08-05莠檎噪 12:17 -0400�靴hris Mason蜀咎%��
quoted
On 
Tue, 2008-08-05 at 13:51 +0900, Hisashi Hifumi wrote:
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.c 
linux-2.6.27-rc1/fs/jbd/transaction.c
quoted
quoted
--- linux-2.6.27-rc1.org/fs/jbd/transaction.c	2008-07-29 
19:28:47.000000000 +0900
quoted
quoted
+++ linux-2.6.27-rc1/fs/jbd/transaction.c	2008-07-29 
20:40:12.000000000 +0900
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
@@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_
 	*/
 	if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & 
__GFP_FS)) {
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
 		journal_wait_for_transaction_sync_data(journal);
+
+		bh = head;
+		do {
+			while (atomic_read(&bh->b_count))
+				schedule();
+		} while ((bh = bh->b_this_page) != head);
 		ret = try_to_free_buffers(page);
 	}
The loop is problematic.  If the scheduler decides to keep 
running this
quoted
quoted
quoted
quoted
quoted
quoted
quoted
task then we have a busy loop.  If this task has realtime 
policy then
quoted
quoted
quoted
quoted
quoted
quoted
quoted
it might even lock up the kernel.
ocfs2 calls journal_try_to_free_buffers too, looping on b_count might
not be the best idea there either.

This code gets called from releasepage, which is used other 
places than
quoted
quoted
quoted
quoted
quoted
quoted
the O_DIRECT invalidation paths, I'd be worried about performance
problems here.
try_to_release_page has gfp_mask parameter. So when try_to_releasepage
is called from performance sensitive part, gfp_mask should not be set.
b_count check loop is inside of (gfp_mask & __GFP_WAIT) && 
(gfp_mask & __GFP_FS) check.
quoted
quoted
quoted
quoted
Looks like try_to_free_pages will go into releasepage with wait & fs
both set.  This kind of change would make me very nervous.
Hi Chris,

The gfp_mask try_to_free_pages() takes from it's caller will past it
down to try_to_release_page().  Based on the meaning of __GFP_WAIT and
GFP_FS, if the upper level caller set these two flags,  I assume the
upper level caller expect delay and wait for fs to finish?


But I agree that using a loop in journal_try_to_free_buffers() to wait
for the busy bh release the counter is expensive...
I rediscovered your old thread about trying to do this in a launder_page
call ;)
  Yes, we thought about using launder_page() before :).
quoted
Does it make more sense to fix do_launder_page to call into the FS on
every page, and let the FS check for PageDirty on its own?  That way
invalidate_inode_pages2_range basically gets its own private call into
the FS that says wait around until this page is really free.
  That would certainly work as well. But IMHO waiting for ->writepage()
call to finish isn't really a big deal even in try_to_release_page() if
__GFP_FS (and __GFP_WAIT) is set. The only problem is that there is no
effective way to do so and so Hisashi used that "wait for b_count to drop"
which looks really scary and I don't like it as well.
I was  looking at the comment in invalidate_complete_page2(), which is
now only called from DIO path, it saids

/*
* This is like invalidate_complete_page(), except it ignores the page's
* refcount.  We do this because invalidate_inode_pages2() needs
stronger
* invalidation guarantees, and cannot afford to leave pages behind
because
* shrink_page_list() has a temp ref on them, or because they're
transiently
* sitting in the lru_cache_add() pagevecs.
*/


I am wondering why we need stronger invalidate hurantees for DIO->
invalidate_inode_pages_range(),which force the page being removed from
page cache? In case of bh is busy due to ext3 writeout,
journal_try_to_free_buffers() could return different error number(EBUSY)
to try_to_releasepage() (instead of EIO).  In that case,  could we just
leave the page in the cache, clean pageuptodate() (to force later buffer
read to read from disk) and then invalidate_complete_page2() return
successfully? Any issue with this way?
My idea is that journal_try_to_free_buffers returns EBUSY if it fails due to
bh busy, and dio write falls back to buffered write. This is easy to fix.

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help