Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
From: Guoqing Jiang <hidden>
Date: 2021-01-26 14:10:20
Also in:
lkml
On 1/26/21 13:58, Donald Buczek wrote:
quoted
Hmm, how about wake the waiter up in the while loop of raid5d?@@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)md_check_recovery(mddev); spin_lock_irq(&conf->device_lock); } + + if ((atomic_read(&conf->active_stripes) + < (conf->max_nr_stripes * 3 / 4) || + (test_bit(MD_RECOVERY_INTR, &mddev->recovery)))) + wake_up(&conf->wait_for_stripe); } pr_debug("%d stripes handled\n", handled);Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at root@sloth:~# cat /proc/$(pgrep md3_resync)/stack [<0>] md_do_sync.cold+0x8ec/0x97c [<0>] md_thread+0xab/0x160 [<0>] kthread+0x11b/0x140 [<0>] ret_from_fork+0x22/0x30 instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync. Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted. wait_event(mddev->recovery_wait, test_bit(MD_RECOVERY_INTR,&mddev->recovery) || !atomic_read(&mddev->recovery_active));
And, unlike before, "md: md3: data-check interrupted." from the pr_info two lines above appears in dmesg.
Yes, that is intentional since MD_RECOVERY_INTR is set by write idle. Anyway, will try the script and investigate more about the issue. Thanks, Guoqing