Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs

From: Mariusz Tkaczyk <hidden>
Date: 2023-08-31 06:51:12
Also in: lkml, regressions

Possibly related (same subject, not in this thread)

2023-09-08 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Song Liu <song@kernel.org>
2023-09-07 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Mariusz Tkaczyk <hidden>
2023-09-07 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Mariusz Tkaczyk <hidden>
2023-09-07 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Yu Kuai <hidden>
2023-08-23 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Song Liu <song@kernel.org>

On Thu, 31 Aug 2023 10:28:44 +0800
Yu Kuai [off-list ref] wrote:

Hi, Acelan,

在 2023/08/28 21:50, Yu Kuai 写道:

quoted

Hi,

在 2023/08/28 13:20, Song Liu 写道:

quoted

Hi AceLan,

Thanks for running the experiments.

On Fri, Aug 25, 2023 at 9:32 PM AceLan Kao [off-list ref] wrote:
[...]

quoted

Could you please run the follow two experiments?

1. Confirm 12a6caf273240a triggers this. Specifically:
    git checkout 12a6caf273240a => repros
    git checkout 12a6caf273240a~1 => cannot repro

Yes, I'm pretty sure about this, that's my bisect result and I just
confirmed it again.
I also tried reverting 12a6caf273240a and the issue is gone.

The log doesn't match my guess. Specifically:

[  420.068142] systemd-shutdown[1]: Stopping MD /dev/md123 (9:123).
[  420.074718] md_open:md123 openers++ = 1 by systemd-shutdow
[  420.080787] systemd-shutdown[1]: Failed to sync MD block device
/dev/md123, ignoring: Input/output error
[  420.090831] md: md123 stopped.
[  420.094465] systemd-shutdown[1]: Stopping MD /dev/md122 (9:122).
[  420.101045] systemd-shutdown[1]: Could not stop MD /dev/md122:
Device or resource busy

I see that:

systemd-shutdown[1]: Couldn't finalize remaining  MD devices, trying again.

Can we make sure is this why power off hang?

Because in my VM, power off is not hang and I got:

systemd-shutdown[1]: Could not stop MD /dev/md1: Device or resource busy
systemd-shutdown[1]: Failed to finalize MD devices, ignoring.

quoted

For a successful stop on md123, we reach the pr_info() in md_open().
For a failed stop on md122, the kernel returns -EBUSY before that
pr_info() in md_open(). There are some changes in md_open() in
the past few release, so I am not quite sure we are looking at the
same code.

By the way, based on code review, looks like md_open never return
-EBUSY, and I think following is the only place can return -EBUSY before
md_open() is called:

blkdev_open
  blkdev_get_by_dev
   bd_prepare_to_claim
    bd_may_claim     -> -EBUSY

Acelan, can you add following debug patch on the top of Song's patch
and reporduce it again? Hope it'll confirm why stop array failed with
-EBUSY.

I saw your reply:

https://lore.kernel.org/all/CAMz9Wg9VXJzZkKWCosm0_=Rz6gG9r3pCMrhvBZQVCQb8i0kd-w@mail.gmail.com/ (local)

But I didn't recieve you emails, please reply to me through this email
address: yukuai3@huawei.com. yukuai1@huaweicloud.com is just used for
sending, and I always missing emails from huaweicloud.com, my apologize
for this inconvenience, I have to do it this way due to my company
policy.

There is something interesting on test result:

At first, md122 is the root device, I guess this is why I can't repoduce
this problem in my local VM.
[   14.478619] systemd-gpt-auto-generator[585]: md122p3: Root device 
/dev/md122.
[   14.481823] block md122: the capability attribute has been deprecated.

Then, before trying to stop md122, following log shows that md122 is
still mounted:
[ 1368.369926] [3462]: Remounting '/' read-only with options 
'errors=remount-ro,stripe=128'.
[ 1368.396287] EXT4-fs (md122p3): re-mounted 
6d53dc8e-3f45-4efa-bc0e-4af477fac217 ro. Quota mode: none.

And since then, there is no log aboug md_open(), which means no one
ever open md122 since this remount.

At last, there are lots of log shows that md122 can't be opened
exclusively:
[ 1368.812952] md122:systemd-shutdow bd_prepare_to_claim return -16
[ 1368.819189] md122:systemd-shutdow blkdev_get_by_dev return -16

Which indicate that md122 is still mounted, and systemd-shutdow can't
stop this array. This behaviour looks correct to me, because rootfs
can't be umounted while systemd-shutdown is still running.

However, there are some questions I don't have answers for now, I'm not
familiar with how systemd works:

1) In the normal case that raid is root device, how can systemd-shutdown
make progress and how is rootfs unmounted?

Thanks for digging into!

In that case systemd-shutdown should skip this device. It should be umounted by
dracut. systemd-shutdown implements naive way to try stop MD devices, simple
ioctl(fd, STOP_ARRAY, NULL).
We are expecting failure if resource is still mounted.

Here dracut code:
https://github.com/dracutdevs/dracut/blob/master/modules.d/90mdraid/md-shutdown.sh

Ohh, it is ubuntu so it is handled by initramfs-tools not by dracut, but I hope
you will find it useful for the future. 

and system-shutdown:
https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c#L483C7-L483C7

When the functionality in systemd-shutdown was fixed by me, I asked if we can
make it configurable- Lennart disagreed so there is now way to disable it now.

2) How does this scenario related to commit 12a6caf273240a.

I'm unable to reproduce it on Redhat, Ubuntu is on the way now.
I'm starting to be convinced that it could be related with initramfs-tools but
I cannot prove this theory yet. I need to reproduce the issue first.

Thanks,
Mariusz

I must read systemd source code to get answers, and it'll be appreciated
if someone has answers.

Thanks,
Kuai

quoted

diff --git a/block/bdev.c b/block/bdev.c
index 979e28a46b98..699739223dcb 100644
--- a/block/bdev.c
+++ b/block/bdev.c

@@ -789,8 +789,11 @@ struct block_device *blkdev_get_by_dev(dev_t dev,

blk_mode_t mode, void *holder,
         if (holder) {
                 mode |= BLK_OPEN_EXCL;
                 ret = bd_prepare_to_claim(bdev, holder, hops);
-               if (ret)
+               if (ret) {
+                       pr_warn("%s:%s bd_prepare_to_claim return %d\n",
+                               disk->disk_name, current->comm, ret);
                         goto put_blkdev;
+               }
         } else {
                 if (WARN_ON_ONCE(mode & BLK_OPEN_EXCL)) {
                         ret = -EIO;

diff --git a/block/fops.c b/block/fops.c
index eaa98a987213..2d69119c71f6 100644
--- a/block/fops.c
+++ b/block/fops.c

@@ -587,8 +587,11 @@ static int blkdev_open(struct inode *inode, struct

file *filp)

         bdev = blkdev_get_by_dev(inode->i_rdev, file_to_blk_mode(filp),
                                  filp->private_data, NULL);
-       if (IS_ERR(bdev))
+       if (IS_ERR(bdev)) {
+               pr_warn("%pD:%s blkdev_get_by_dev return %ld\n",
+                       filp, current->comm, PTR_ERR(bdev));
                 return PTR_ERR(bdev);
+       }

         if (bdev_nowait(bdev))
                 filp->f_mode |= FMODE_NOWAIT;

Thanks,
Kuai

quoted

Therefore, could you please help clarify:

1. Which base kernel are you using?

quoted

From the log, you are using 6.5-rc7-706a74159504. However,

I think we cannot cleanly revert 12a6caf273240a on top of
6.5-rc7-706a74159504. Did you manually fix some issue in the
revert? If so, could you please share the revert commit?

2. If you are not using 6.5-rc7-706a74159504 as base kernel, which
one are you using?

Thanks,
Song

quoted

2. Try with the following change (add debug messages), which hopefully
    shows which command is holding a reference on mddev->openers.

Thanks,
Song

diff --git i/drivers/md/md.c w/drivers/md/md.c
index 78be7811a89f..3e9b718b32c1 100644
--- i/drivers/md/md.c
+++ w/drivers/md/md.c

@@ -7574,11 +7574,15 @@ static int md_ioctl(struct block_device *bdev,

blk_mode_t mode,
                 if (mddev->pers && atomic_read(&mddev->openers) > 1) {
                         mutex_unlock(&mddev->open_mutex);
                         err = -EBUSY;
+                       pr_warn("%s return -EBUSY for %s with
mddev->openers = %d\n",
+                               __func__, mdname(mddev),
atomic_read(&mddev->openers));
                         goto out;
                 }
                 if (test_and_set_bit(MD_CLOSING, &mddev->flags)) {
                         mutex_unlock(&mddev->open_mutex);
                         err = -EBUSY;
+                       pr_warn("%s return -EBUSY for %s with
MD_CLOSING bit set\n",
+                               __func__, mdname(mddev));
                         goto out;
                 }
                 did_set_md_closing = true;

@@ -7789,6 +7793,8 @@ static int md_open(struct gendisk *disk,

blk_mode_t mode)
                 goto out_unlock;

         atomic_inc(&mddev->openers);
+       pr_info("%s:%s openers++ = %d by %s\n", __func__, 
mdname(mddev),
+               atomic_read(&mddev->openers), current->comm);
         mutex_unlock(&mddev->open_mutex);

         disk_check_media_change(disk);

@@ -7807,6 +7813,8 @@ static void md_release(struct gendisk *disk)

         BUG_ON(!mddev);
         atomic_dec(&mddev->openers);
+       pr_info("%s:%s openers-- = %d by %s\n", __func__, 
mdname(mddev),
+               atomic_read(&mddev->openers), current->comm);
         mddev_put(mddev);
  }

It's pretty strange that I can't reproduce the issue after applied 
the patch.

I tried to figure out which part affect the issue and found when I
comment out the pr_info() In md_release(), the issue could be
reproduced.

-- 
Chia-Lin Kao(AceLan)
http://blog.acelan.idv.tw/
E-Mail: acelan.kaoATcanonical.com (s/AT/@/)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help