Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2020-11-28
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2020-11-30
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2020-12-01
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2020-12-02
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2020-12-03
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2020-12-03
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2020-12-21
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-19
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-01-20
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-23
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-25
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-02-02
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-02-08
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-02-08
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-02-08
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-02-09
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-02-09
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Marc Smith <hidden> · 2023-03-14
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2023-03-14
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Marc Smith <hidden> · 2023-03-14
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Marc Smith <hidden> · 2023-03-16
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Song Liu <song@kernel.org> · 2023-03-29
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Dragan Stancevic <hidden> · 2023-08-22
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-08-23
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Dragan Stancevic <hidden> · 2023-08-23
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-08-24
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Dragan Stancevic <hidden> · 2023-08-28
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-08-30
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-09-05
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Dragan Stancevic <hidden> · 2023-09-05
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2023-09-13
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Dragan Stancevic <hidden> · 2023-09-13
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2023-09-14
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2023-09-17
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2023-09-24
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-09-25
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2023-09-25
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-09-25
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-03-15
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2023-03-15
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-03-15
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2023-03-15

From: Dragan Stancevic <hidden>
Date: 2023-08-22 21:23:15
Also in: lkml

On Tue, 3/28/23 17:01 Song Liu wrote:

On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith [off-list ref] 
wr=
ote:
 >
 > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith 
[off-list ref]=
  wrote:
 > >
 > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang 
<guoqing.jiang@li=
nux.dev> wrote:
 > > >
 > > >
 > > >
 > > > On 3/14/23 21:25, Marc Smith wrote:
 > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang
 > > > > [off-list ref] wrote:
 > > > >> Hi Donald,
 > > > >>
 > > > >> On 2/8/21 19:41, Donald Buczek wrote:
 > > > >>> Dear Guoqing,
 > > > >>>
 > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
 > > > >>>>
 > > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
 > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
 > > > >>>>>> sync_thread, like this.
 > > > >>>>>>
 > > > >>>>>>           /* resync has finished, collect result */
 > > > >>>>>>           mddev_unlock(mddev);
 > > > >>>>>>           md_unregister_thread(&mddev->sync_thread);
 > > > >>>>>>           mddev_lock(mddev);
 > > > >>>>> As above: While we wait for the sync thread to terminate, 
would=
n't it
 > > > >>>>> be a problem, if another user space operation takes the mutex?
 > > > >>>> I don't think other places can be blocked while hold mutex, 
othe=
rwise
 > > > >>>> these places can cause potential deadlock. Please try above 
two =
lines
 > > > >>>> change. And perhaps others have better idea.
 > > > >>> Yes, this works. No deadlock after >11000 seconds,
 > > > >>>
 > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, 
434, 12=
65,
 > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
 > > > >> Great. I will send a formal patch with your reported-by and 
tested=
-by.
 > > > >>
 > > > >> Thanks,
 > > > >> Guoqing
 > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks 
like 1/=
2
 > > > > of the patches that supposedly resolve this were applied to the 
sta=
ble
 > > > > kernels, however, one was omitted due to a regression:
 > > > > md: don't unregister sync_thread with reconfig_mutex held 
(upstream
 > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
 > > > >
 > > > > I don't see any follow-up on the thread from June 8th 2022 
asking f=
or
 > > > > this patch to be dropped from all stable kernels since it caused a
 > > > > regression.
 > > > >
 > > > > The patch doesn't appear to be present in the current mainline 
kern=
el
 > > > > (6.3-rc2) either. So I assume this issue is still present 
there, or=
  it
 > > > > was resolved differently and I just can't find the commit/patch.
 > > >
 > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before 
rea=
p
 > > > sync_thread in action_store".
 > >
 > > Okay, let me try applying that patch... it does not appear to be
 > > present in my 5.4.229 kernel source. Thanks.
 >
 > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
 > sync_thread in action_store"' patch on top of vanilla 5.4.229 source
 > appears to fix the problem for me -- I can't reproduce the issue with
 > the script, and it's been running for >24 hours now. (Previously I was
 > able to induce the issue within a matter of minutes.)

Hi Marc,

Could you please run your reproducer on the md-tmp branch?

https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp

This contains a different version of the fix by Yu Kuai.

Thanks,
Song

Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. The change
"9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue for me.

But I did pull the changes from the md-tmp branch you are refering:
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp

I was not totally clear on which change exactly to pull, but I pulled
the following changes:
2023-03-28 md: enhance checking in md_check_recovery()md-tmp	Yu Kuai	1 -7/+15
2023-03-28 md: wake up 'resync_wait' at last in md_reap_sync_thread()	Yu Kuai	1 -1/+1
2023-03-28 md: refactor idle/frozen_sync_thread()	Yu Kuai	2 -4/+22
2023-03-28 md: add a mutex to synchronize idle and frozen in action_store()	Yu Kuai	2 -0/+8
2023-03-28 md: refactor action_store() for 'idle' and 'frozen'	Yu Kuai	1 -16/+45

I used to be able to reproduce the lockup within minutes, but with those
changes the test system has been running for more than 120 hours.

When you said a "different fix", can you confirm that I grabbed the right
changes and that I need all 5 of them.

And second question was, has this fix been submitted upstream yet?
If so which kernel version?

Thank you

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help