Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
From: Guoqing Jiang <hidden>
Date: 2023-03-14 13:56:45
Also in:
lkml
On 3/14/23 21:25, Marc Smith wrote:
On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang [off-list ref] wrote:quoted
Hi Donald, On 2/8/21 19:41, Donald Buczek wrote:quoted
Dear Guoqing, On 08.02.21 15:53, Guoqing Jiang wrote:quoted
On 2/8/21 12:38, Donald Buczek wrote:quoted
quoted
5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this. /* resync has finished, collect result */ mddev_unlock(mddev); md_unregister_thread(&mddev->sync_thread); mddev_lock(mddev);As above: While we wait for the sync thread to terminate, wouldn't it be a problem, if another user space operation takes the mutex?I don't think other places can be blocked while hold mutex, otherwise these places can cause potential deadlock. Please try above two lines change. And perhaps others have better idea.Yes, this works. No deadlock after >11000 seconds, (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )Great. I will send a formal patch with your reported-by and tested-by. Thanks, GuoqingI'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2 of the patches that supposedly resolve this were applied to the stable kernels, however, one was omitted due to a regression: md: don't unregister sync_thread with reconfig_mutex held (upstream commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) I don't see any follow-up on the thread from June 8th 2022 asking for this patch to be dropped from all stable kernels since it caused a regression. The patch doesn't appear to be present in the current mainline kernel (6.3-rc2) either. So I assume this issue is still present there, or it was resolved differently and I just can't find the commit/patch.
It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap sync_thread in action_store". Thanks, Guoqing