Thread (5 messages) 5 messages, 2 authors, 2023-05-08

Re: mdadm grow raid 5 to 6 failure (crash)

From: David Gilmour <hidden>
Date: 2023-05-08 05:58:07

I'm not sure what I'm looking for here but here is the output of the
inflight file immediately after the mdadm assemble hangs. Does this
indicate something accessing the array?

#cat /sys/block/md127/inflight
       1        0

Also attached is an strace of my mdadm command that hung in case that
reveals something relevant:
strace mdadm --assemble --verbose
--backup-file=/root/mdadm5-6_backup_md127 --invalid-backup /dev/md127
/dev/sda /dev/sdh /dev/sdg /dev/sdc /dev/sde /dev/sdf --force 2>&1 |
tee mdadm_strace_output.txt

On Sun, May 7, 2023 at 7:23 PM Yu Kuai [off-list ref] wrote:
Hi,

在 2023/05/06 21:19, David Gilmour 写道:
quoted
quoted
From what I can tell it does look very similar. I stopped the
systemd-udevd service and renamed it to systemd-udevd.bak. My system
still hung on the assemble command. I'm not savvy enough to decode the
details here but does the "mddev_suspend.part.0+0xdf/0x150" line in
the process stack output suggest the same i/o block the other post
indicates?

× systemd-udevd.service - Rule-based Manager for Device Events and Files
      Loaded: loaded (/usr/lib/systemd/system/systemd-udevd.service; static)
      Active: failed (Result: exit-code) since Sat 2023-05-06 06:59:11
MDT; 1min 27s ago
    Duration: 1d 20h 16min 29.633s
TriggeredBy: × systemd-udevd-kernel.socket
              × systemd-udevd-control.socket
        Docs: man:systemd-udevd.service(8)
              man:udev(7)
     Process: 27440 ExecStart=/usr/lib/systemd/systemd-udevd
(code=exited, status=203/EXEC)
    Main PID: 27440 (code=exited, status=203/EXEC)
         CPU: 5ms

----------------------
#mdadm --assemble --verbose --backup-file=/root/mdadm5-6_backup_md127
--invalid-backup /dev/md127 /dev/sda /dev/sdh /dev/sdg /dev/sdc
/dev/sdb /dev/sdf --force
mdadm: looking for devices for /dev/md127
mdadm: /dev/sda is identified as a member of /dev/md127, slot 0.
mdadm: /dev/sdh is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdg is identified as a member of /dev/md127, slot 2.
mdadm: /dev/sdc is identified as a member of /dev/md127, slot 3.
mdadm: /dev/sdb is identified as a member of /dev/md127, slot 4.
mdadm: /dev/sdf is identified as a member of /dev/md127, slot 5.
mdadm: /dev/md127 has an active reshape - checking if critical section
needs to be restored
mdadm: No backup metadata on /root/mdadm5-6_backup_md127
mdadm: Failed to find backup of critical section
mdadm: continuing without restoring backup
mdadm: added /dev/sdh to /dev/md127 as 1
mdadm: added /dev/sdg to /dev/md127 as 2
mdadm: added /dev/sdc to /dev/md127 as 3
mdadm: added /dev/sdb to /dev/md127 as 4
mdadm: added /dev/sdf to /dev/md127 as 5 (possibly out of date)
mdadm: added /dev/sda to /dev/md127 as 0

#hangs indefinitely at this point in the output

------------------------------------------


root       27454  0.0  0.0   3812  2656 pts/1    D+   07:00   0:00
mdadm --assemble --verbose --backup-file=/root/mdadm5-6_backup_md127
--invalid-backup /dev/md127 /dev/sda /dev/sdh /dev/sdg /dev/sdc
/dev/sdb /dev/sdf --force
root       27457  0.0  0.0      0     0 ?        S    07:00   0:00 [md127_raid6]

#cat /proc/27454/stack
[<0>] mddev_suspend.part.0+0xdf/0x150
[<0>] suspend_lo_store+0xc5/0xf0
[<0>] md_attr_store+0x83/0xf0
[<0>] kernfs_fop_write_iter+0x124/0x1b0
[<0>] new_sync_write+0xff/0x190
[<0>] vfs_write+0x1ef/0x280
[<0>] ksys_write+0x5f/0xe0
[<0>] do_syscall_64+0x5c/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

#cat /proc/27457/stack
[<0>] md_thread+0x122/0x160
[<0>] kthread+0xe0/0x100
[<0>] ret_from_fork+0x22/0x30
Is there any thread stuck at raid5_make_request? something like below:

Apr 23 19:17:22 atom kernel: task:systemd-udevd   state:D stack:    0
pid: 8121 ppid:   706 flags:0x00000006
Apr 23 19:17:22 atom kernel: Call Trace:
Apr 23 19:17:22 atom kernel:  <TASK>
Apr 23 19:17:22 atom kernel:  __schedule+0x20a/0x550
Apr 23 19:17:22 atom kernel:  schedule+0x5a/0xc0
Apr 23 19:17:22 atom kernel:  schedule_timeout+0x11f/0x160
Apr 23 19:17:22 atom kernel:  ? make_stripe_request+0x284/0x490 [raid456]
Apr 23 19:17:22 atom kernel:  wait_woken+0x50/0x70
Apr 23 19:17:22 atom kernel:  raid5_make_request+0x2cb/0x3e0 [raid456]
Apr 23 19:17:22 atom kernel:  ? sched_show_numa+0xf0/0xf0
Apr 23 19:17:22 atom kernel:  md_handle_request+0x132/0x1e0
Apr 23 19:17:22 atom kernel:  ? do_mpage_readpage+0x282/0x6b0
Apr 23 19:17:22 atom kernel:  __submit_bio+0x86/0x130
Apr 23 19:17:22 atom kernel:  __submit_bio_noacct+0x81/0x1f0
Apr 23 19:17:22 atom kernel:  mpage_readahead+0x15c/0x1d0
Apr 23 19:17:22 atom kernel:  ? blkdev_write_begin+0x20/0x20
Apr 23 19:17:22 atom kernel:  read_pages+0x58/0x2f0
Apr 23 19:17:22 atom kernel:  page_cache_ra_unbounded+0x137/0x180
Apr 23 19:17:22 atom kernel:  force_page_cache_ra+0xc5/0xf0
Apr 23 19:17:22 atom kernel:  filemap_get_pages+0xe4/0x350
Apr 23 19:17:22 atom kernel:  filemap_read+0xbe/0x3c0
Apr 23 19:17:22 atom kernel:  ? make_kgid+0x13/0x20
Apr 23 19:17:22 atom kernel:  ? deactivate_locked_super+0x90/0xa0
Apr 23 19:17:22 atom kernel:  blkdev_read_iter+0xaf/0x170
Apr 23 19:17:22 atom kernel:  new_sync_read+0xf9/0x180
Apr 23 19:17:22 atom kernel:  vfs_read+0x13c/0x190
Apr 23 19:17:22 atom kernel:  ksys_read+0x5f/0xe0
Apr 23 19:17:22 atom kernel:  do_syscall_64+0x59/0x90

By the way, cat /sys/block/mdxx/inflight can prove this as well.

If this is the case, can you find out who is accessing the array?

Thanks,
Kuai

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help