Re: [PATCH v5 01/14] md: don't ignore suspended array in md_check_recovery()
From: Xiao Ni <hidden>
Date: 2024-02-18 03:15:18
Also in:
dm-devel, lkml
Subsystem:
software raid (multiple disks) support, the rest · Maintainers:
Song Liu, Yu Kuai, Linus Torvalds
On Sun, Feb 18, 2024 at 10:34 AM Yu Kuai [off-list ref] wrote:
Hi, 在 2024/02/18 10:27, Xiao Ni 写道:quoted
On Sun, Feb 18, 2024 at 9:46 AM Yu Kuai [off-list ref] wrote:quoted
Hi, 在 2024/02/18 9:33, Xiao Ni 写道:quoted
The deadlock problem mentioned in this patch should not be right?No, I think it's right. Looks like you are expecting other problems, like mentioned in patch 6, to be fixed by this patch.Hi Kuai Could you explain why step1 and step2 from this comment can happen simultaneously? From the log, the process should be The process is : dev_remove->dm_destroy->__dm_destroy->dm_table_postsuspend_targets(raid_postsuspend) -> dm_table_destroy(raid_dtr). After suspending the array, it calls raid_dtr. So these two functions can't happen simultaneously.You're removing the target directly, however, dm can suspend the disk directly, you can simplily: 1) dmsetup suspend xxx 2) dmsetup remove xxx
For dm-raid, the design of suspend stops sync thread first and then it calls mddev_suspend to suspend array. So I'm curious why the sync thread can still exit when array is suspended. I know the reason now. Because before f52f5c71f (md: fix stopping sync thread), the process is raid_postsuspend->md_stop_writes->__md_stop_writes (__md_stop_writes sets MD_RECOVERY_FROZEN). In patch f52f5c71f, it doesn't set MD_RECOVERY_FROZEN in __md_stop_writes anymore. The process changes to 1. raid_postsuspend->md_stop_writes->__md_stop_writes->stop_sync_thread (wait until MD_RECOVERY_RUNNING clears) 2. md thread -> md_check_recovery -> unregister_sync_thread -> md_reap_sync_thread (clears MD_RECOVERY_RUNNING, stop_sync_thread returns, md_reap_sync_thread sets MD_RECOVERY_NEEDED) 3. raid_postsuspend->mddev_suspend 4. md sync thread starts again because __md_stop_writes doesn't set MD_RECOVERY_FROZEN. It's the reason why we can see sync thread still happens when raid is suspended. So the patch fix this problem should:
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 9e41a9aaba8b..666761466f02 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c@@ -6315,6 +6315,7 @@ static void md_clean(struct mddev *mddev) static void __md_stop_writes(struct mddev *mddev) { + set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); stop_sync_thread(mddev, true, false); del_timer_sync(&mddev->safemode_timer);
Like other places which call stop_sync_thread, it needs to set the MD_RECOVERY_FROZEN bit. Regards Xiao
Please also take a look at other patches, why step 1) can't stop sync thread. Thanks, Kuaiquoted
quoted
Noted that this patch just fix one case that MD_RECOVERY_RUNNING can't be cleared, I you are testing this patch alone, please make sure that you still triggered the exactly same case: - MD_RCOVERY_RUNNING can't be cleared while array is suspended.I'm not testing this patch. I want to understand the patch well. So I need to understand the issue first. I can't understand how this deadlock (step1,step2) happens. Regards Xiaoquoted
Thanks, Kuai.