Re: xfstests 073 regression

From: Dave Chinner <david@fromorbit.com>
Date: 2011-08-01 01:28:35
Also in: lkml

On Sun, Jul 31, 2011 at 02:25:27PM -1000, Linus Torvalds wrote:

On Sun, Jul 31, 2011 at 1:47 PM, Dave Chinner [off-list ref] wrote:

quoted

Yes, I already have, a couple of hours before you sent this:

http://www.spinics.net/lists/linux-fsdevel/msg47357.html

We haven't found the root cause of the problem, and writeback cannot
hold off grab_super_passive() because writeback only holds read
locks on s_umount, just like grab_super_passive.

With read-write semaphores, even read-vs-read recursion is a deadlock
possibility.

Why? Because if a writer comes in on another thread, while the read
lock is initially held, then the writer will now block. And due to
fairness, now a subsequent reader will *also* block.

Yes, I know.

So no, nesting readers is *not* allowed for rw_semaphores even if
naively you'd think it should work.

Right, hence my question of where the deadlock was. I can see how we
get a temporary holdoff due to background flushing holding the
s_umount lock (and hence why this change prevents that holdoff) but
that should not deadlock even if there is a pending write lock.

So if xfstests 073 does mount/umount testing, then it is entirely
possible that a reader blocks another reader due to a pending writer.

And I know that remount,ro will take the lock in write mode, hence
cause such a holdoff to occur. Test 073 does this, and it is likely
to be the cause of the problem.

However, if we can't grab the superblock, it implies it that there
is something major happening on the superblock. It's likely to be
going away, or in the case of a remount, something significant is
changing.  In both cases, writeback of dirty data is going to be
handled by the umount/remount process, and so whatever writeback
that is currently in progress is just getting in the way of
executing the higher level, more important operation.

AFAICT, there's a bunch of non-trivial ways that we can get
read-write-read deadlocks that the bdi-flusher "avoids" by ignoring
inodes when it can't get the superblock. However, if it can't get
the superblock reference, why shouldn't we just abort whatever
writeback that is going through writeback_inodes_wb? That writeback
is guaranteed to be unable to complete until the reference to
s_umount is dropped, so why do we continue to try one inode at a
time?

IOWs, what I'm asking is whether this "just move the inodes one at a
time to a different queue" is just a bandaid for a particular
symptom of a deeper problem we haven't realised existed....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help