Re: [bug report] INFO: task mdX_resync:42168 blocked for more than 122 seconds
From: Yu Kuai <hidden>
Date: 2024-05-17 02:49:28
Also in:
dm-devel, linux-block
Hi, 在 2024/05/17 10:25, Changhui Zhong 写道:
On Thu, May 16, 2024 at 7:42 PM Yu Kuai [off-list ref] wrote:quoted
Hi, 在 2024/05/16 19:21, Ming Lei 写道:quoted
Cc raid and dm list. On Thu, May 16, 2024 at 06:24:18PM +0800, Changhui Zhong wrote:quoted
Hello, when create lvm raid1, the command hang on for a long time. please help check it and let me know if you need any info/testing for it, thanks.Is this a new test, or a new problem?it is a new problem, I am not hit this issue on 6.9.0-rc4+
There is just one patch for raid1 applied since v6.9-rc4, and I think it's not related. Perhaps can you try to bisect?
quoted
quoted
quoted
repo:https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git branch:for-next commit: 59ef8180748269837975c9656b586daa16bb9def reproducer: dd if=/dev/zero bs=1M count=2000 of=file0.img dd if=/dev/zero bs=1M count=2000 of=file1.img dd if=/dev/zero bs=1M count=2000 of=file2.img dd if=/dev/zero bs=1M count=2000 of=file4.img losetup -fP --show file0.img losetup -fP --show file1.img losetup -fP --show file2.img losetup -fP --show file3.imgabove dd creat file4, here is file3.yeah,this is my spelling mistake, I created 4 files, file0/1/2/3quoted
quoted
quoted
pvcreate -y /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 vgcreate black_bird /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 lvcreate --type raid1 -m 3 -n non_synced_primary_raid_3legs_1 -L 1G black_bird /dev/loop0:0-300 /dev/loop1:0-300 /dev/loop2:0-300 /dev/loop3:0-300I don't understand what /dev/loopx:0-300 means, and I remove them, fix the above file4 typo, test on a xfs filesystem, and I can't reporduce the problem.I want to specify the space from disk blocks 0 to 300 of the loop device to create raid1,not all space of loop device, follow reproducer setps I can reproduced it 100%
Okay, I add the 0-300 and I still can't reporduce it, have no clue yet.
quoted
quoted
quoted
console log: May 21 21:57:41 dell-per640-04 journal: Create raid1 May 21 21:57:41 dell-per640-04 kernel: device-mapper: raid: Superblocks created for new raid set May 21 21:57:42 dell-per640-04 kernel: md/raid1:mdX: not clean -- starting background reconstruction May 21 21:57:42 dell-per640-04 kernel: md/raid1:mdX: active with 4 out of 4 mirrors May 21 21:57:42 dell-per640-04 kernel: mdX: bitmap file is out of date, doing full recovery May 21 21:57:42 dell-per640-04 kernel: md: resync of RAID array mdX May 21 21:57:42 dell-per640-04 systemd[1]: Started Device-mapper event daemon. May 21 21:57:42 dell-per640-04 dmeventd[42170]: dmeventd ready for processing. May 21 21:57:42 dell-per640-04 dmeventd[42170]: Monitoring RAID device black_bird-non_synced_primary_raid_3legs_1 for events. May 21 21:57:45 dell-per640-04 restraintd[1446]: *** Current Time: Tue May 21 21:57:45 2024 Localwatchdog at: Tue May 21 22:56:45 2024 May 21 21:58:45 dell-per640-04 restraintd[1446]: *** Current Time: Tue May 21 21:58:45 2024 Localwatchdog at: Tue May 21 22:56:45 2024 May 21 21:59:45 dell-per640-04 restraintd[1446]: *** Current Time: Tue May 21 21:59:45 2024 Localwatchdog at: Tue May 21 22:56:45 2024 May 21 21:59:53 dell-per640-04 kernel: INFO: task mdX_resync:42168 blocked for more than 122 seconds. May 21 21:59:53 dell-per640-04 kernel: Not tainted 6.9.0+ #1 May 21 21:59:53 dell-per640-04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 21 21:59:53 dell-per640-04 kernel: task:mdX_resync state:D stack:0 pid:42168 tgid:42168 ppid:2 flags:0x00004000 May 21 21:59:53 dell-per640-04 kernel: Call Trace: May 21 21:59:53 dell-per640-04 kernel: <TASK> May 21 21:59:53 dell-per640-04 kernel: __schedule+0x222/0x670 May 21 21:59:53 dell-per640-04 kernel: ? blk_mq_flush_plug_list+0x5/0x20 May 21 21:59:53 dell-per640-04 kernel: schedule+0x2c/0xb0 May 21 21:59:53 dell-per640-04 kernel: raise_barrier+0x107/0x200 [raid1]Unless this is a deadlock, raise_barrier() should be waiting for normal IO that is issued to underlying disk to return. If you can reporduce the problem, can you check IO from underlying loop disks? cat /sys/block/loopx/inflightwhen this issue was triggered, the log I collected: [root@storageqe-103 ~]# cat /sys/block/loop0/inflight 0 0 [root@storageqe-103 ~]# cat /sys/block/loop1/inflight 0 0 [root@storageqe-103 ~]# cat /sys/block/loop2/inflight 0 0 [root@storageqe-103 ~]# cat /sys/block/loop3/inflight 0 0 [root@storageqe-103 ~]#
Thanks for the test, this do look like a deadlock, beside raise_barrier(), is there any other victim? I can't reporduce this, and I have no clue yet. The possible next step might be bisect to locate the blame commit first. Maybe related to dm-raid1. Thanks, Kuai
and the command "lvs" hang on also, [root@storageqe-103 ~]# lvs ^C Interrupted... Giving up waiting for lock. Can't get lock for black_bird. Cannot process volume group black_bird LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert home rhel_storageqe-103 -wi-ao---- <368.43g root rhel_storageqe-103 -wi-ao---- 70.00g swap rhel_storageqe-103 -wi-ao---- 7.70g [root@storageqe-103 ~]# [ 1352.761630] INFO: task mdX_resync:1547 blocked for more than 1105 seconds. [ 1352.769336] Not tainted 6.9.0+ #1 [ 1352.773629] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1352.782372] task:mdX_resync state:D stack:0 pid:1547 tgid:1547 ppid:2 flags:0x00004000 [ 1352.782380] Call Trace: [ 1352.782382] <TASK> [ 1352.782386] __schedule+0x222/0x670 [ 1352.782396] schedule+0x2c/0xb0 [ 1352.782402] raise_barrier+0x107/0x200 [raid1] [ 1352.782415] ? __pfx_autoremove_wake_function+0x10/0x10 [ 1352.782423] raid1_sync_request+0x12d/0xa50 [raid1] [ 1352.782435] ? prepare_to_wait_event+0x5f/0x190 [ 1352.782442] md_do_sync+0x660/0x1040 [ 1352.782449] ? __pfx_autoremove_wake_function+0x10/0x10 [ 1352.782457] md_thread+0xad/0x160 [ 1352.782462] ? __pfx_md_thread+0x10/0x10 [ 1352.782465] kthread+0xdc/0x110 [ 1352.782470] ? __pfx_kthread+0x10/0x10 [ 1352.782474] ret_from_fork+0x2d/0x50 [ 1352.782481] ? __pfx_kthread+0x10/0x10 [ 1352.782485] ret_from_fork_asm+0x1a/0x30 [ 1352.782491] </TASK> Thanks, Changhuiquoted
Thanks, Kuaiquoted
quoted
May 21 21:59:53 dell-per640-04 kernel: ? __pfx_autoremove_wake_function+0x10/0x10 May 21 21:59:53 dell-per640-04 kernel: raid1_sync_request+0x12d/0xa50 [raid1] May 21 21:59:53 dell-per640-04 kernel: ? __pfx_raid1_sync_request+0x10/0x10 [raid1] May 21 21:59:53 dell-per640-04 kernel: md_do_sync+0x660/0x1040 May 21 21:59:53 dell-per640-04 kernel: ? __pfx_autoremove_wake_function+0x10/0x10 May 21 21:59:53 dell-per640-04 kernel: md_thread+0xad/0x160 May 21 21:59:53 dell-per640-04 kernel: ? __pfx_md_thread+0x10/0x10 May 21 21:59:53 dell-per640-04 kernel: kthread+0xdc/0x110 May 21 21:59:53 dell-per640-04 kernel: ? __pfx_kthread+0x10/0x10 May 21 21:59:53 dell-per640-04 kernel: ret_from_fork+0x2d/0x50 May 21 21:59:53 dell-per640-04 kernel: ? __pfx_kthread+0x10/0x10 May 21 21:59:53 dell-per640-04 kernel: ret_from_fork_asm+0x1a/0x30 May 21 21:59:53 dell-per640-04 kernel: </TASK> -- Best Regards, Changhui.