Re: [PATCH] jbd jbd2: fix dio write returning EIO whentry_to_release_page fails
From: Mingming Cao <hidden>
Date: 2008-08-06 22:57:58
Also in:
linux-fsdevel
在 2008-08-06三的 15:53 +0200,Jan Kara写道:
On Wed 06-08-08 09:25:13, Chris Mason wrote:quoted
On Tue, 2008-08-05 at 14:17 -0700, Mingming Cao wrote:quoted
在 2008-08-05二的 12:17 -0400,Chris Mason写道:quoted
On Tue, 2008-08-05 at 13:51 +0900, Hisashi Hifumi wrote:quoted
quoted
quoted
quoted
diff -Nrup linux-2.6.27-rc1.org/fs/jbd/transaction.clinux-2.6.27-rc1/fs/jbd/transaction.cquoted
quoted
--- linux-2.6.27-rc1.org/fs/jbd/transaction.c 2008-07-2919:28:47.000000000 +0900quoted
quoted
+++ linux-2.6.27-rc1/fs/jbd/transaction.c 2008-07-29 20:40:12.000000000 +0900@@ -1764,6 +1764,12 @@ int journal_try_to_free_buffers(journal_ */ if (ret == 0 && (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS)) { journal_wait_for_transaction_sync_data(journal); + + bh = head; + do { + while (atomic_read(&bh->b_count)) + schedule(); + } while ((bh = bh->b_this_page) != head); ret = try_to_free_buffers(page); }The loop is problematic. If the scheduler decides to keep running this task then we have a busy loop. If this task has realtime policy then it might even lock up the kernel.ocfs2 calls journal_try_to_free_buffers too, looping on b_count might not be the best idea there either. This code gets called from releasepage, which is used other places than the O_DIRECT invalidation paths, I'd be worried about performance problems here.try_to_release_page has gfp_mask parameter. So when try_to_releasepage is called from performance sensitive part, gfp_mask should not be set. b_count check loop is inside of (gfp_mask & __GFP_WAIT) && (gfp_mask & __GFP_FS) check.Looks like try_to_free_pages will go into releasepage with wait & fs both set. This kind of change would make me very nervous.Hi Chris, The gfp_mask try_to_free_pages() takes from it's caller will past it down to try_to_release_page(). Based on the meaning of __GFP_WAIT and GFP_FS, if the upper level caller set these two flags, I assume the upper level caller expect delay and wait for fs to finish? But I agree that using a loop in journal_try_to_free_buffers() to wait for the busy bh release the counter is expensive...I rediscovered your old thread about trying to do this in a launder_page call ;)Yes, we thought about using launder_page() before :).quoted
Does it make more sense to fix do_launder_page to call into the FS on every page, and let the FS check for PageDirty on its own? That way invalidate_inode_pages2_range basically gets its own private call into the FS that says wait around until this page is really free.That would certainly work as well. But IMHO waiting for ->writepage() call to finish isn't really a big deal even in try_to_release_page() if __GFP_FS (and __GFP_WAIT) is set. The only problem is that there is no effective way to do so and so Hisashi used that "wait for b_count to drop" which looks really scary and I don't like it as well.
I was looking at the comment in invalidate_complete_page2(), which is now only called from DIO path, it saids /* * This is like invalidate_complete_page(), except it ignores the page's * refcount. We do this because invalidate_inode_pages2() needs stronger * invalidation guarantees, and cannot afford to leave pages behind because * shrink_page_list() has a temp ref on them, or because they're transiently * sitting in the lru_cache_add() pagevecs. */ I am wondering why we need stronger invalidate hurantees for DIO-> invalidate_inode_pages_range(),which force the page being removed from page cache? In case of bh is busy due to ext3 writeout, journal_try_to_free_buffers() could return different error number(EBUSY) to try_to_releasepage() (instead of EIO). In that case, could we just leave the page in the cache, clean pageuptodate() (to force later buffer read to read from disk) and then invalidate_complete_page2() return successfully? Any issue with this way? Mingming
Honza
-- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html