Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2020-11-28
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2020-11-30
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2020-12-01
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2020-12-02
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2020-12-03
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2020-12-03
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2020-12-21
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-19
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-01-20
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-23
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-25
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-01-26
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-02-02
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-02-08
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-02-08
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-02-08
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2021-02-09
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2021-02-09
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Marc Smith <hidden> · 2023-03-14
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2023-03-14
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Marc Smith <hidden> · 2023-03-14
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Marc Smith <hidden> · 2023-03-16
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Song Liu <song@kernel.org> · 2023-03-29
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Dragan Stancevic <hidden> · 2023-08-22
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-08-23
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Dragan Stancevic <hidden> · 2023-08-23
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-08-24
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Dragan Stancevic <hidden> · 2023-08-28
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-08-30
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-09-05
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Dragan Stancevic <hidden> · 2023-09-05
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2023-09-13
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Dragan Stancevic <hidden> · 2023-09-13
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2023-09-14
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2023-09-17
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2023-09-24
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-09-25
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2023-09-25
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-09-25
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-03-15
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Guoqing Jiang <hidden> · 2023-03-15
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Yu Kuai <hidden> · 2023-03-15
Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition · Donald Buczek <hidden> · 2023-03-15

From: Guoqing Jiang <hidden>
Date: 2020-12-03 01:56:20
Also in: lkml

Hi Donald,

On 12/2/20 18:28, Donald Buczek wrote:

Dear Guoqing,

unfortunately the patch didn't fix the problem (unless I messed it up 
with my logging). This is what I used:

     --- a/drivers/md/md.c
     +++ b/drivers/md/md.c
     @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
                             clear_bit(MD_RECOVERY_NEEDED, 
&mddev->recovery);
                             goto unlock;
                     }

I think you can add the check of RECOVERY_CHECK in above part instead of 
add a new part.

     +               if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
     +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
     +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
     +                       /* resync/recovery still happening */
     +                       pr_info("md: XXX BUGFIX applied\n");
     +                       clear_bit(MD_RECOVERY_NEEDED, 
&mddev->recovery);
     +                       goto unlock;
     +               }
                     if (mddev->sync_thread) {
                             md_reap_sync_thread(mddev);
                             goto unlock;

With pausing and continuing the check four times an hour, I could 
trigger the problem after about 48 hours. This time, the other device 
(md0) has locked up on `echo idle > 
/sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 
is still ongoing:

Without the patch, md0 was good while md1 was locked. So the patch 
switches the status of the two arrays, a little weird ...

What is the stack of the process? I guess it is same as the stack of 
23333 in your previous mail, but just to confirm.

     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
[multipath]
     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] 
sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
2 [16/16] [UUUUUUUUUUUUUUUU]
           [=>...................]  check =  8.5% (666852112/7813894144) 
finish=1271.2min speed=93701K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk
     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] 
sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] 
sdu[2] sdt[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
2 [16/16] [UUUUUUUUUUUUUUUU]
           [>....................]  check =  0.2% (19510348/7813894144) 
finish=253779.6min speed=511K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk

after 1 minute:

     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
[multipath]
     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] 
sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
2 [16/16] [UUUUUUUUUUUUUUUU]
           [=>...................]  check =  8.6% (674914560/7813894144) 
finish=941.1min speed=126418K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk
     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] 
sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] 
sdu[2] sdt[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
2 [16/16] [UUUUUUUUUUUUUUUU]
           [>....................]  check =  0.2% (19510348/7813894144) 
finish=256805.0min speed=505K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk

A data point, I didn't mention in my previous mail, is that the 
mdX_resync thread is not gone when the problem occurs:

     buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md
     root       134     2  0 Nov30 ?        00:00:00 [md]
     root      1321     2 27 Nov30 ?        12:57:14 [md0_raid6]
     root      1454     2 26 Nov30 ?        12:37:23 [md1_raid6]
     root      5845     2  0 16:20 ?        00:00:30 [md0_resync]
     root      5855     2 13 16:20 ?        00:14:11 [md1_resync]
     buczek    9880  9072  0 18:05 pts/0    00:00:00 grep -F [md
     buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat 
/proc/5845/stack
     [<0>] md_bitmap_cond_end_sync+0x12d/0x170
     [<0>] raid5_sync_request+0x24b/0x390
     [<0>] md_do_sync+0xb41/0x1030
     [<0>] md_thread+0x122/0x160
     [<0>] kthread+0x118/0x130
     [<0>] ret_from_fork+0x1f/0x30

I guess, md_bitmap_cond_end_sync+0x12d is the 
`wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) 
== 0);` in md-bitmap.c.

Could be, if so, then I think md_done_sync was not triggered by the path 
md0_raid6 -> ... -> handle_stripe.

I'd suggest to compare the stacks between md0 and md1 to find the 
difference.

Thanks,
Guoqing

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help