Re: [PATCH 2/2] md: fix deadlock while suspending RAID array

From: NeilBrown <hidden>
Date: 2014-03-19 23:02:49
Also in: lkml

On Tue, 11 Mar 2014 20:12:10 +0100 Simon Guinot [off-list ref]
wrote:

quoted hunk ↗ jump to hunk

Sometimes a deadlock happens while migrating a RAID array level, using
the mdadm --grow command. In the following example, an ext4 filesystem
is installed over a RAID1 array and mdadm is used to transform this
array into a RAID5 one. Here are the observed backtraces for the locked
tasks:

jbd2/dm-0-8     D c0478384     0  9100      2 0x00000000
[<c0478384>] (__schedule+0x154/0x320) from [<c0157c68>] (jbd2_journal_commit_transaction+0x1b0/0x132c)
[<c0157c68>] (jbd2_journal_commit_transaction+0x1b0/0x132c) from [<c015b5f4>] (kjournald2+0x9c/0x200)
[<c015b5f4>] (kjournald2+0x9c/0x200) from [<c003558c>] (kthread+0xa4/0xb0)
[<c003558c>] (kthread+0xa4/0xb0) from [<c000df18>] (ret_from_fork+0x14/0x3c)
ext4lazyinit    D c0478384     0  9113      2 0x00000000
[<c0478384>] (__schedule+0x154/0x320) from [<c0364ed0>] (md_write_start+0xd8/0x194)
[<c0364ed0>] (md_write_start+0xd8/0x194) from [<bf004c80>] (make_request+0x3c/0xc5c [raid1])
[<bf004c80>] (make_request+0x3c/0xc5c [raid1]) from [<c0363784>] (md_make_request+0xe4/0x1f8)
[<c0363784>] (md_make_request+0xe4/0x1f8) from [<c025f544>] (generic_make_request+0xa8/0xc8)
[<c025f544>] (generic_make_request+0xa8/0xc8) from [<c025f5e4>] (submit_bio+0x80/0x12c)
[<c025f5e4>] (submit_bio+0x80/0x12c) from [<c0265648>] (__blkdev_issue_zeroout+0x134/0x1a0)
[<c0265648>] (__blkdev_issue_zeroout+0x134/0x1a0) from [<c0265748>] (blkdev_issue_zeroout+0x94/0xa0)
[<c0265748>] (blkdev_issue_zeroout+0x94/0xa0) from [<c011a3e8>] (ext4_init_inode_table+0x178/0x2cc)
[<c011a3e8>] (ext4_init_inode_table+0x178/0x2cc) from [<c0129fac>] (ext4_lazyinit_thread+0xe8/0x288)
[<c0129fac>] (ext4_lazyinit_thread+0xe8/0x288) from [<c003558c>] (kthread+0xa4/0xb0)
[<c003558c>] (kthread+0xa4/0xb0) from [<c000df18>] (ret_from_fork+0x14/0x3c)
mdadm           D c0478384     0 10163   9465 0x00000000
[<c0478384>] (__schedule+0x154/0x320) from [<c0362edc>] (mddev_suspend+0x68/0xc0)
[<c0362edc>] (mddev_suspend+0x68/0xc0) from [<c0363080>] (level_store+0x14c/0x59c)
[<c0363080>] (level_store+0x14c/0x59c) from [<c03665ac>] (md_attr_store+0xac/0xdc)
[<c03665ac>] (md_attr_store+0xac/0xdc) from [<c00eee38>] (sysfs_write_file+0x100/0x168)
[<c00eee38>] (sysfs_write_file+0x100/0x168) from [<c0098598>] (vfs_write+0xb8/0x184)
[<c0098598>] (vfs_write+0xb8/0x184) from [<c009893c>] (SyS_write+0x40/0x6c)
[<c009893c>] (SyS_write+0x40/0x6c) from [<c000de80>] (ret_fast_syscall+0x0/0x30)

This deadlock can be reproduced on different architecture (ARM and x86)
and also with different Linux kernel versions: 3.14-rc and 3.10 stable.

The problem comes from the mddev_suspend() function which don't allow
mddev->thread to complete the pending I/Os (mddev->active_io) if any:

1. mdadm holds mddev->reconfig_mutex before running mddev_suspend().
   If a write I/O is submitted while mdadm holds the mutex and when the
   RAID array is still not suspended, then mddev->thread is not able to
   complete the I/O: The superblock can't be updated because
   mddev->reconfig_mutex is not available. Note that having a write I/O
   over a "not suspended yet" RAID array is not a marginal scenario:
   To load a new RAID personality, level_store() calls request_module()
   which is allowed to schedule. Moreover on a SMP or a preemptible
   kernel, the odds are probably even greater.

2. In a same way, mddev_suspend() sets the mddev->suspended flag. Again
   this may prevent mddev->thread to complete some pending I/Os when
   a superblock update is needed: md_check_recovery, used by the RAID
   threads, does nothing but exits when the mddev->suspended flag is
   set. As a consequence the superblock is never updated.

This patch solves this issues by ensuring there is no pending active
I/Os before suspending effectively a RAID array.

Signed-off-by: Simon Guinot <redacted>
Tested-by: Rémi Rérolle <redacted>
---
 drivers/md/md.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index fb4296adae80..ea3e95d1972b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c

@@ -375,9 +375,22 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
 void mddev_suspend(struct mddev *mddev)
 {
 	BUG_ON(mddev->suspended);
-	mddev->suspended = 1;
-	synchronize_rcu();
-	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
+
+	for (;;) {
+		mddev->suspended = 1;
+		synchronize_rcu();
+		if (atomic_read(&mddev->active_io) == 0)
+			break;
+		mddev->suspended = 0;
+		synchronize_rcu();
+		/*
+		 * Note that mddev_unlock is also used to wake up mddev->thread.
+		 * This allows to complete the pending mddev->active_io.
+		 */
+		mddev_unlock(mddev);
+		wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
+		mddev_lock_nointr(mddev);
+	}
 	mddev->pers->quiesce(mddev, 1);
 
 	del_timer_sync(&mddev->safemode_timer);


Hi Simon,
 thanks for the report and the patch.

Dropping a lock inside a function which was called with the lock held always
makes me feel nervous.  It is quite possible that it is safe, but I find it
very hard to convince myself that it is safe.  So I would rather avoid it if
I could.

Would it be possible to perform that superblock update etc inside
mddev_suspend?
e.g.

   mddev->suspended = 1;
   synchronize_rcu();
   while  (atomic_read(&mddev->active_io) > 0) {
      prepare_wait(... mddev->sb_wait..);
      if (mddev->flags & MD_UPDATE_SB_FLAGS)
          md_update_sb(mddev, 0);
      schedule()
   }
   finish_wait()

and then in md_check_recovery()

   if (mddev->suspended) {
       if (mddev->flags & MD_UPDATE_SB_FLAGS)
           wake_up(mddev->sb_wait);
       return;
   }

Probably a lot of details are missing but hopefully the idea is clear.

Does that seem reasonable to you?  Would you like to try coding that and see
how it goes?

Thanks,
NeilBrown

Attachments

signature.asc [application/pgp-signature] 828 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help