Re: [PATCH] block: fix deadlock between bd_link_disk_holder and partition scan

From: Yu Kuai <hidden>
Date: 2024-02-18 07:47:20
Also in: linux-block, lkml

Hi,

在 2024/02/17 3:03, Song Liu 写道:

quoted hunk ↗ jump to hunk

On Thu, Feb 8, 2024 at 4:49 PM Song Liu [off-list ref] wrote:

quoted

On Thu, Feb 8, 2024 at 12:44 AM Li Nan [off-list ref] wrote:

quoted



在 2024/2/8 14:50, Song Liu 写道:

quoted

On Wed, Feb 7, 2024 at 1:32 AM [off-list ref] wrote:

quoted

From: Li Nan <redacted>

'open_mutex' of gendisk is used to protect open/close block devices. But
in bd_link_disk_holder(), it is used to protect the creation of symlink
between holding disk and slave bdev, which introduces some issues.

When bd_link_disk_holder() is called, the driver is usually in the process
of initialization/modification and may suspend submitting io. At this
time, any io hold 'open_mutex', such as scanning partitions, can cause
deadlocks. For example, in raid:

T1                              T2
bdev_open_by_dev
   lock open_mutex [1]
   ...
    efi_partition
    ...
     md_submit_bio
                                  md_ioctl mddev_syspend
                                    -> suspend all io
                                   md_add_new_disk
                                    bind_rdev_to_array
                                     bd_link_disk_holder
                                      try lock open_mutex [2]
      md_handle_request
       -> wait mddev_resume

T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume
mddev, but T2 waits for open_mutex held by T1. Deadlock occurs.

Fix it by introducing a local mutex 'holder_mutex' to replace 'open_mutex'.

Is this to fix [1]? Do we need some Fixes and/or Closes tags?

No. Just use another way to fix [2], and both [2] and this patch can fix
the issue. I am not sure about the root cause of [1] yet.

[2] https://patchwork.kernel.org/project/linux-raid/list/?series=812045

quoted

Could you please add steps to reproduce this issue?

We need to modify the kernel, add sleep in md_submit_bio() and md_ioctl()
as below, and then:
    1. mdadm -CR /dev/md0 -l1 -n2 /dev/sd[bc]  #create a raid
    2. echo 1 > /sys/module/md_mod/parameters/error_inject  #enable sleep
    3. 'mdadm --add /dev/md0 /dev/sda'  #add a disk to raid
    4. submit ioctl BLKRRPART to raid within 10s.

The analysis makes sense. I also hit the issue a couple times without adding
extra delays. But I am not sure whether this is the best fix (I didn't find real
issues with it either).

To be extra safe and future proof, we can do something like the
following to only
suspend the array for ADD_NEW_DISK on not-running arrays.

This appear to solve the problem reported in

https://bugzilla.kernel.org/show_bug.cgi?id=218459

Thanks,
Song

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 9e41a9aaba8b..395911d5f4d6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c

@@ -7570,10 +7570,11 @@ static inline bool md_ioctl_valid(unsigned int cmd)
         }
  }

-static bool md_ioctl_need_suspend(unsigned int cmd)
+static bool md_ioctl_need_suspend(struct mddev *mddev, unsigned int cmd)
  {
         switch (cmd) {
         case ADD_NEW_DISK:
+               return mddev->pers != NULL;

Did you check already that this problem is not related that 'active_io'
is leaked for flush IO?

I don't understand the problem reported yet. If 'mddev->pers' is not set
yet, md_submit_bio() will return directly, and 'active_io' should not be
grabbed in the first place.

md_run() is the only place to convert 'mddev->pers' from NULL to a real
personality, and it's protected by 'reconfig_mutex', however,
md_ioctl_need_suspend() is called without 'reconfig_mutex', hence there
is a race condition:

md_ioctl_need_suspend		array_state_store
  // mddev->pers is NULL, return false
				 mddev_lock
				 do_md_run
				  mddev->pers = xxx
				 mddev_unlock

  // mddev_suspend is not called
  mddev_lock
  md_add_new_disk
   if (mddev->pers)
    md_import_device
    bind_rdev_to_array
    add_bound_rdev
     mddev->pers->hot_add_disk
     -> hot add disk without suspending

Thanks,
Kuai

quoted hunk ↗ jump to hunk

         case HOT_ADD_DISK:
         case HOT_REMOVE_DISK:
         case SET_BITMAP_FILE:

@@ -7625,6 +7626,7 @@ static int md_ioctl(struct block_device *bdev,

blk_mode_t mode,
         void __user *argp = (void __user *)arg;
         struct mddev *mddev = NULL;
         bool did_set_md_closing = false;
+       bool need_suspend;

         if (!md_ioctl_valid(cmd))
                 return -ENOTTY;

@@ -7716,8 +7718,11 @@ static int md_ioctl(struct block_device *bdev,

blk_mode_t mode,
         if (!md_is_rdwr(mddev))
                 flush_work(&mddev->sync_work);

-       err = md_ioctl_need_suspend(cmd) ? mddev_suspend_and_lock(mddev) :
-                                          mddev_lock(mddev);
+       need_suspend = md_ioctl_need_suspend(mddev, cmd);
+       if (need_suspend)
+               err = mddev_suspend_and_lock(mddev);
+       else
+               err = mddev_lock(mddev);
         if (err) {
                 pr_debug("md: ioctl lock interrupted, reason %d, cmd %d\n",
                          err, cmd);

@@ -7846,8 +7851,10 @@ static int md_ioctl(struct block_device *bdev,

blk_mode_t mode,
             err != -EINVAL)
                 mddev->hold_active = 0;

-       md_ioctl_need_suspend(cmd) ? mddev_unlock_and_resume(mddev) :
-                                    mddev_unlock(mddev);
+       if (need_suspend)
+               mddev_unlock_and_resume(mddev);
+       else
+               mddev_unlock(mddev);

  out:
         if(did_set_md_closing)
.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help