Thread (49 messages) 49 messages, 8 authors, 2023-09-25

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

From: Dragan Stancevic <hidden>
Date: 2023-08-22 21:23:15
Also in: lkml

On Tue, 3/28/23 17:01 Song Liu wrote:
On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith [off-list ref] 
wr=
ote:
 >
 > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith 
[off-list ref]=
  wrote:
 > >
 > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang 
<guoqing.jiang@li=
nux.dev> wrote:
 > > >
 > > >
 > > >
 > > > On 3/14/23 21:25, Marc Smith wrote:
 > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang
 > > > > [off-list ref] wrote:
 > > > >> Hi Donald,
 > > > >>
 > > > >> On 2/8/21 19:41, Donald Buczek wrote:
 > > > >>> Dear Guoqing,
 > > > >>>
 > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
 > > > >>>>
 > > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
 > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
 > > > >>>>>> sync_thread, like this.
 > > > >>>>>>
 > > > >>>>>>           /* resync has finished, collect result */
 > > > >>>>>>           mddev_unlock(mddev);
 > > > >>>>>>           md_unregister_thread(&mddev->sync_thread);
 > > > >>>>>>           mddev_lock(mddev);
 > > > >>>>> As above: While we wait for the sync thread to terminate, 
would=
n't it
 > > > >>>>> be a problem, if another user space operation takes the mutex?
 > > > >>>> I don't think other places can be blocked while hold mutex, 
othe=
rwise
 > > > >>>> these places can cause potential deadlock. Please try above 
two =
lines
 > > > >>>> change. And perhaps others have better idea.
 > > > >>> Yes, this works. No deadlock after >11000 seconds,
 > > > >>>
 > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, 
434, 12=
65,
 > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
 > > > >> Great. I will send a formal patch with your reported-by and 
tested=
-by.
 > > > >>
 > > > >> Thanks,
 > > > >> Guoqing
 > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks 
like 1/=
2
 > > > > of the patches that supposedly resolve this were applied to the 
sta=
ble
 > > > > kernels, however, one was omitted due to a regression:
 > > > > md: don't unregister sync_thread with reconfig_mutex held 
(upstream
 > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
 > > > >
 > > > > I don't see any follow-up on the thread from June 8th 2022 
asking f=
or
 > > > > this patch to be dropped from all stable kernels since it caused a
 > > > > regression.
 > > > >
 > > > > The patch doesn't appear to be present in the current mainline 
kern=
el
 > > > > (6.3-rc2) either. So I assume this issue is still present 
there, or=
  it
 > > > > was resolved differently and I just can't find the commit/patch.
 > > >
 > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before 
rea=
p
 > > > sync_thread in action_store".
 > >
 > > Okay, let me try applying that patch... it does not appear to be
 > > present in my 5.4.229 kernel source. Thanks.
 >
 > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
 > sync_thread in action_store"' patch on top of vanilla 5.4.229 source
 > appears to fix the problem for me -- I can't reproduce the issue with
 > the script, and it's been running for >24 hours now. (Previously I was
 > able to induce the issue within a matter of minutes.)

Hi Marc,

Could you please run your reproducer on the md-tmp branch?

https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp

This contains a different version of the fix by Yu Kuai.

Thanks,
Song
Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. The change
"9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue for me.

But I did pull the changes from the md-tmp branch you are refering:
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp

I was not totally clear on which change exactly to pull, but I pulled
the following changes:
2023-03-28 md: enhance checking in md_check_recovery()md-tmp	Yu Kuai	1 -7/+15
2023-03-28 md: wake up 'resync_wait' at last in md_reap_sync_thread()	Yu Kuai	1 -1/+1
2023-03-28 md: refactor idle/frozen_sync_thread()	Yu Kuai	2 -4/+22
2023-03-28 md: add a mutex to synchronize idle and frozen in action_store()	Yu Kuai	2 -0/+8
2023-03-28 md: refactor action_store() for 'idle' and 'frozen'	Yu Kuai	1 -16/+45

I used to be able to reproduce the lockup within minutes, but with those
changes the test system has been running for more than 120 hours.

When you said a "different fix", can you confirm that I grabbed the right
changes and that I need all 5 of them.

And second question was, has this fix been submitted upstream yet?
If so which kernel version?

Thank you

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help