Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs
From: Mariusz Tkaczyk <hidden>
Date: 2023-08-31 06:51:12
Also in:
lkml, regressions
Possibly related (same subject, not in this thread)
- 2023-09-08 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Song Liu <song@kernel.org>
- 2023-09-07 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Mariusz Tkaczyk <hidden>
- 2023-09-07 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Mariusz Tkaczyk <hidden>
- 2023-09-07 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Yu Kuai <hidden>
- 2023-08-23 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Song Liu <song@kernel.org>
On Thu, 31 Aug 2023 10:28:44 +0800 Yu Kuai [off-list ref] wrote:
Hi, Acelan, 在 2023/08/28 21:50, Yu Kuai 写道:quoted
Hi, 在 2023/08/28 13:20, Song Liu 写道:quoted
Hi AceLan, Thanks for running the experiments. On Fri, Aug 25, 2023 at 9:32 PM AceLan Kao [off-list ref] wrote: [...]quoted
quoted
Could you please run the follow two experiments? 1. Confirm 12a6caf273240a triggers this. Specifically: git checkout 12a6caf273240a => repros git checkout 12a6caf273240a~1 => cannot reproYes, I'm pretty sure about this, that's my bisect result and I just confirmed it again. I also tried reverting 12a6caf273240a and the issue is gone.The log doesn't match my guess. Specifically: [ 420.068142] systemd-shutdown[1]: Stopping MD /dev/md123 (9:123). [ 420.074718] md_open:md123 openers++ = 1 by systemd-shutdow [ 420.080787] systemd-shutdown[1]: Failed to sync MD block device /dev/md123, ignoring: Input/output error [ 420.090831] md: md123 stopped. [ 420.094465] systemd-shutdown[1]: Stopping MD /dev/md122 (9:122). [ 420.101045] systemd-shutdown[1]: Could not stop MD /dev/md122: Device or resource busyI see that: systemd-shutdown[1]: Couldn't finalize remaining MD devices, trying again. Can we make sure is this why power off hang? Because in my VM, power off is not hang and I got: systemd-shutdown[1]: Could not stop MD /dev/md1: Device or resource busy systemd-shutdown[1]: Failed to finalize MD devices, ignoring.quoted
For a successful stop on md123, we reach the pr_info() in md_open(). For a failed stop on md122, the kernel returns -EBUSY before that pr_info() in md_open(). There are some changes in md_open() in the past few release, so I am not quite sure we are looking at the same code.By the way, based on code review, looks like md_open never return -EBUSY, and I think following is the only place can return -EBUSY before md_open() is called: blkdev_open blkdev_get_by_dev bd_prepare_to_claim bd_may_claim -> -EBUSY Acelan, can you add following debug patch on the top of Song's patch and reporduce it again? Hope it'll confirm why stop array failed with -EBUSY.I saw your reply: https://lore.kernel.org/all/CAMz9Wg9VXJzZkKWCosm0_=Rz6gG9r3pCMrhvBZQVCQb8i0kd-w@mail.gmail.com/ (local) But I didn't recieve you emails, please reply to me through this email address: yukuai3@huawei.com. yukuai1@huaweicloud.com is just used for sending, and I always missing emails from huaweicloud.com, my apologize for this inconvenience, I have to do it this way due to my company policy. There is something interesting on test result: At first, md122 is the root device, I guess this is why I can't repoduce this problem in my local VM. [ 14.478619] systemd-gpt-auto-generator[585]: md122p3: Root device /dev/md122. [ 14.481823] block md122: the capability attribute has been deprecated. Then, before trying to stop md122, following log shows that md122 is still mounted: [ 1368.369926] [3462]: Remounting '/' read-only with options 'errors=remount-ro,stripe=128'. [ 1368.396287] EXT4-fs (md122p3): re-mounted 6d53dc8e-3f45-4efa-bc0e-4af477fac217 ro. Quota mode: none. And since then, there is no log aboug md_open(), which means no one ever open md122 since this remount. At last, there are lots of log shows that md122 can't be opened exclusively: [ 1368.812952] md122:systemd-shutdow bd_prepare_to_claim return -16 [ 1368.819189] md122:systemd-shutdow blkdev_get_by_dev return -16 Which indicate that md122 is still mounted, and systemd-shutdow can't stop this array. This behaviour looks correct to me, because rootfs can't be umounted while systemd-shutdown is still running. However, there are some questions I don't have answers for now, I'm not familiar with how systemd works: 1) In the normal case that raid is root device, how can systemd-shutdown make progress and how is rootfs unmounted?
Thanks for digging into! In that case systemd-shutdown should skip this device. It should be umounted by dracut. systemd-shutdown implements naive way to try stop MD devices, simple ioctl(fd, STOP_ARRAY, NULL). We are expecting failure if resource is still mounted. Here dracut code: https://github.com/dracutdevs/dracut/blob/master/modules.d/90mdraid/md-shutdown.sh Ohh, it is ubuntu so it is handled by initramfs-tools not by dracut, but I hope you will find it useful for the future. and system-shutdown: https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c#L483C7-L483C7 When the functionality in systemd-shutdown was fixed by me, I asked if we can make it configurable- Lennart disagreed so there is now way to disable it now.
2) How does this scenario related to commit 12a6caf273240a.
I'm unable to reproduce it on Redhat, Ubuntu is on the way now. I'm starting to be convinced that it could be related with initramfs-tools but I cannot prove this theory yet. I need to reproduce the issue first. Thanks, Mariusz
I must read systemd source code to get answers, and it'll be appreciated if someone has answers. Thanks, Kuaiquoted
diff --git a/block/bdev.c b/block/bdev.c index 979e28a46b98..699739223dcb 100644 --- a/block/bdev.c +++ b/block/bdev.c@@ -789,8 +789,11 @@ struct block_device *blkdev_get_by_dev(dev_t dev,blk_mode_t mode, void *holder, if (holder) { mode |= BLK_OPEN_EXCL; ret = bd_prepare_to_claim(bdev, holder, hops); - if (ret) + if (ret) { + pr_warn("%s:%s bd_prepare_to_claim return %d\n", + disk->disk_name, current->comm, ret); goto put_blkdev; + } } else { if (WARN_ON_ONCE(mode & BLK_OPEN_EXCL)) { ret = -EIO;diff --git a/block/fops.c b/block/fops.c index eaa98a987213..2d69119c71f6 100644 --- a/block/fops.c +++ b/block/fops.c@@ -587,8 +587,11 @@ static int blkdev_open(struct inode *inode, structfile *filp) bdev = blkdev_get_by_dev(inode->i_rdev, file_to_blk_mode(filp), filp->private_data, NULL); - if (IS_ERR(bdev)) + if (IS_ERR(bdev)) { + pr_warn("%pD:%s blkdev_get_by_dev return %ld\n", + filp, current->comm, PTR_ERR(bdev)); return PTR_ERR(bdev); + } if (bdev_nowait(bdev)) filp->f_mode |= FMODE_NOWAIT; Thanks, Kuaiquoted
Therefore, could you please help clarify: 1. Which base kernel are you using?quoted
From the log, you are using 6.5-rc7-706a74159504. However,I think we cannot cleanly revert 12a6caf273240a on top of 6.5-rc7-706a74159504. Did you manually fix some issue in the revert? If so, could you please share the revert commit? 2. If you are not using 6.5-rc7-706a74159504 as base kernel, which one are you using? Thanks, Songquoted
quoted
2. Try with the following change (add debug messages), which hopefully shows which command is holding a reference on mddev->openers. Thanks, Songdiff --git i/drivers/md/md.c w/drivers/md/md.c index 78be7811a89f..3e9b718b32c1 100644 --- i/drivers/md/md.c +++ w/drivers/md/md.c@@ -7574,11 +7574,15 @@ static int md_ioctl(struct block_device *bdev,blk_mode_t mode, if (mddev->pers && atomic_read(&mddev->openers) > 1) { mutex_unlock(&mddev->open_mutex); err = -EBUSY; + pr_warn("%s return -EBUSY for %s with mddev->openers = %d\n", + __func__, mdname(mddev), atomic_read(&mddev->openers)); goto out; } if (test_and_set_bit(MD_CLOSING, &mddev->flags)) { mutex_unlock(&mddev->open_mutex); err = -EBUSY; + pr_warn("%s return -EBUSY for %s with MD_CLOSING bit set\n", + __func__, mdname(mddev)); goto out; } did_set_md_closing = true;@@ -7789,6 +7793,8 @@ static int md_open(struct gendisk *disk,blk_mode_t mode) goto out_unlock; atomic_inc(&mddev->openers); + pr_info("%s:%s openers++ = %d by %s\n", __func__, mdname(mddev), + atomic_read(&mddev->openers), current->comm); mutex_unlock(&mddev->open_mutex); disk_check_media_change(disk);@@ -7807,6 +7813,8 @@ static void md_release(struct gendisk *disk)BUG_ON(!mddev); atomic_dec(&mddev->openers); + pr_info("%s:%s openers-- = %d by %s\n", __func__, mdname(mddev), + atomic_read(&mddev->openers), current->comm); mddev_put(mddev); }It's pretty strange that I can't reproduce the issue after applied the patch. I tried to figure out which part affect the issue and found when I comment out the pr_info() In md_release(), the issue could be reproduced. -- Chia-Lin Kao(AceLan) http://blog.acelan.idv.tw/ E-Mail: acelan.kaoATcanonical.com (s/AT/@/)..