Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs

From: Song Liu <song@kernel.org>
Date: 2023-09-08 20:25:30
Also in: lkml, regressions

Possibly related (same subject, not in this thread)

2023-09-07 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Mariusz Tkaczyk <hidden>
2023-09-07 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Yu Kuai <hidden>
2023-09-06 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Mariusz Tkaczyk <hidden>
2023-09-06 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · AceLan Kao <hidden>
2023-08-31 · Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs · Mariusz Tkaczyk <hidden>

On Thu, Sep 7, 2023 at 8:24 AM Mariusz Tkaczyk
[off-list ref] wrote:

quoted hunk ↗ jump to hunk

On Thu, 7 Sep 2023 20:53:30 +0800
Yu Kuai [off-list ref] wrote:

quoted

Hi,

在 2023/09/07 20:41, Mariusz Tkaczyk 写道:

quoted

On Thu, 7 Sep 2023 20:14:03 +0800
Yu Kuai [off-list ref] wrote:

quoted

Hi,

在 2023/09/07 19:26, Yu Kuai 写道:

quoted

Hi,

在 2023/09/07 18:18, Mariusz Tkaczyk 写道:

quoted

On Thu, 7 Sep 2023 10:04:11 +0800
Yu Kuai [off-list ref] wrote:

quoted

Hi,

在 2023/09/06 18:27, Mariusz Tkaczyk 写道:

quoted

On Wed, 6 Sep 2023 14:26:30 +0800
AceLan Kao [off-list ref] wrote:

quoted

   From previous testing, I don't think it's an issue in systemd, so I
did a simple test and found the issue is gone.
You only need to add a small delay in md_release(), then the issue
can't be reproduced.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 78be7811a89f..ef47e34c1af5 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c

@@ -7805,6 +7805,7 @@ static void md_release(struct gendisk *disk)

{
          struct mddev *mddev = disk->private_data;

+       msleep(10);
          BUG_ON(!mddev);
          atomic_dec(&mddev->openers);
          mddev_put(mddev);

I have repro and I tested it on my setup. It is not working for me.
My setup could be more "advanced" to maximalize chance of reproduction:

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] [raid0]
md121 : active raid0 nvme2n1[1] nvme5n1[0]
         7126394880 blocks super external:/md127/0 128k chunks

md122 : active raid10 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
         104857600 blocks super external:/md126/0 64K chunks 2
near-copies
[4/4] [UUUU]

md123 : active raid5 nvme6n1[3] nvme4n1[2] nvme1n1[1] nvme7n1[0]
         2655765504 blocks super external:/md126/1 level 5, 32k chunk,
algorithm 0 [4/4] [UUUU]

md124 : active raid1 nvme0n1[1] nvme3n1[0]
         99614720 blocks super external:/md125/0 [2/2] [UU]

md125 : inactive nvme3n1[1](S) nvme0n1[0](S)
         10402 blocks super external:imsm

md126 : inactive nvme7n1[3](S) nvme1n1[2](S) nvme6n1[1](S)
nvme4n1[0](S)
         20043 blocks super external:imsm

md127 : inactive nvme2n1[1](S) nvme5n1[0](S)
         10402 blocks super external:imsm

I have almost 99% repro ratio, slowly moving forward..

It is endless loop because systemd-shutdown sends ioctl "stop_array"
which
is successful but array is not stopped. For that reason it sets
"changed =
true".

How does systemd-shutdown judge if array is stopped? cat /proc/mdstat or
ls /dev/md* or other way?

Hi Yu,

It trusts return result, I confirmed that 0 is returned.
The most weird is we are returning 0 but array is still there, and it is
stopped again in next systemd loop. I don't understand why yet..

quoted

Systemd-shutdown see the change and retries to check if there is
something
else which can be stopped now, and again, again...

I will check what is returned first, it could be 0 or it could be
positive
errno (nit?) because systemd cares "if(r < 0)".

I do noticed that there are lots of log about md123 stopped:

[ 1371.834034] md122:systemd-shutdow bd_prepare_to_claim return -16
[ 1371.840294] md122:systemd-shutdow blkdev_get_by_dev return -16
[ 1371.846845] md: md123 stopped.
[ 1371.850155] md122:systemd-shutdow bd_prepare_to_claim return -16
[ 1371.856411] md122:systemd-shutdow blkdev_get_by_dev return -16
[ 1371.862941] md: md123 stopped.

And md_ioctl->do_md_stop doesn't have error path after printing this
log, hence 0 will be returned to user.

The normal case is that:

open md123
ioctl STOP_ARRAY -> all rdev should be removed from array
close md123 -> mddev will finally be freed by:
     md_release
      mddev_put
       set_bit(MD_DELETED, &mddev->flags) -> user shound not see this
mddev
       queue_work(md_misc_wq, &mddev->del_work)

     mddev_delayed_delete
      kobject_put(&mddev->kobj)

     md_kobj_release
      del_gendisk
       md_free_disk
        mddev_free

Ok thanks, I understand that md_release is called on descriptor
closing, right?

Yes, normally close md123 should drop that last reference.

quoted

Now that you can reporduce this problem 99%, can you dig deeper and find
out what is wrong?

Yes, working on it!

My first idea was that mddev_get and mddev_put are missing on
md_ioctl() path
but it doesn't help for the issue. My motivation here was that
md_attr_store and
md_attr_show are using them.

Systemd regenerates list of MD arrays on every loop and it is always
there, systemd is able to open file descriptor (maybe inactive?).

md123 should not be opended again, ioctl(STOP_ARRAY) already set the
flag 'MD_CLOSING' to prevent that. Are you sure that systemd-shutdown do
open and close the array in each loop?

I realized that I'm wrong here. 'MD_CLOSING' is cleared before ioctl
return by commit 065e519e71b2 ("md: MD_CLOSING needs to be cleared after
called md_set_readonly or do_md_stop").

I'm confused here, commit message said 'MD_CLOSING' shold not be set for
the case STOP_ARRAY_RO, but I don't understand why it's cleared for
STOP_ARRAY as well.

Can you try the following patch?

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3afd57622a0b..31b9cec7e4c0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c

@@ -7668,7 +7668,8 @@ static int md_ioctl(struct block_device *bdev,

fmode_t mode,
                         err = -EBUSY;
                         goto out;
                 }
-               did_set_md_closing = true;
+               if (cmd == STOP_ARRAY_RO)
+                       did_set_md_closing = true;
                 mutex_unlock(&mddev->open_mutex);
                 sync_blockdev(bdev);
         }

I think prevent array to be opened again after STOP_ARRAY might fix
this.

I didn't help. I tried much more:

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0fe7ab6e8ab9..807387f37755 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c

@@ -608,7 +608,7 @@ static inline struct mddev *mddev_get(struct mddev *mddev)
 {
        lockdep_assert_held(&all_mddevs_lock);

-       if (test_bit(MD_DELETED, &mddev->flags))
+       if (test_bit(MD_DELETED, &mddev->flags) || test_bit(MD_CLOSING,
  &mddev->flags)) return NULL;
        atomic_inc(&mddev->active);
        return mddev;

Issue is still reproducible. I was shocked so I removed clearing MD_CLOSING:
 out:
-       if(did_set_md_closing)
-               clear_bit(MD_CLOSING, &mddev->flags);
+       //if(did_set_md_closing)
+       //      clear_bit(MD_CLOSING, &mddev->flags);

Still reproducible. Then I found this:

@@ -6177,7 +6179,7 @@ static void md_clean(struct mddev *mddev)
        mddev->persistent = 0;
        mddev->level = LEVEL_NONE;
        mddev->clevel[0] = 0;
-       mddev->flags = 0;
+       //mddev->flags = 0;
        mddev->sb_flags = 0;
        mddev->ro = MD_RDWR;
        mddev->metadata_type[0] = 0;

Issue not reproducible. I can see in log that attempt to stop device is not
repeated.

Well, with the change of using time-sensitive lock in mddev_put we need to
ensure that we can still can remove mddevice so setting and preserving
MD_CLOSING is reasonable for me. I will combine this into patches tomorrow.

Yu, really appreciate your help! Glad to heave you here.

Thanks to everyone for looking into this tricky issue and running various tests!

Mariusz, please let me know if you need any help with the patch. Hopefully we
can fix this soon.

Thanks again,
Song

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help