Thread (11 messages) 11 messages, 2 authors, 2021-10-26

Re: Errors after successful disk replace

From: Qu Wenruo <hidden>
Date: 2021-10-26 12:17:54


On 2021/10/26 20:16, Emil Heimpel wrote:
After reapplying the thermal paste and adding a fan to my storage controller (LSI SAS2008) I haven't encountered any abnormal behavior of my drives. Scrub is running (disk by disk) and everything looks fine so far.

I only got one task blocked message in dmesg, is that anything I should worry about?
Sometimes we can also hit such blocked message for certain test cases.

As long as the fs is still submitting IO, and you find proper CPU usage,
it should be fine.

Thanks,
Qu

Sun Oct 24 18:22:11 2021] BTRFS info (device sdg1): scrub: started on devid 5
[Mon Oct 25 15:31:23 2021] BTRFS info (device sdg1): scrub: finished on devid 5 with status: 0
[Mon Oct 25 15:31:26 2021] BTRFS info (device sdg1): scrub: started on devid 3
[Mon Oct 25 21:12:34 2021] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79800
[Mon Oct 25 23:01:43 2021] hrtimer: interrupt took 13811 ns
[Tue Oct 26 02:58:17 2021] BTRFS info (device sdg1): scrub: finished on devid 3 with status: 0
[Tue Oct 26 02:58:21 2021] BTRFS info (device sdg1): scrub: started on devid 6
[Tue Oct 26 12:23:36 2021] INFO: task btrfs:341674 blocked for more than 122 seconds.
[Tue Oct 26 12:23:36 2021]       Tainted: G           OE     5.14.14-arch1-1 #1
[Tue Oct 26 12:23:36 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Oct 26 12:23:36 2021] task:btrfs           state:D stack:    0 pid:341674 ppid:341658 flags:0x00004000
[Tue Oct 26 12:23:36 2021] Call Trace:
[Tue Oct 26 12:23:36 2021]  __schedule+0x333/0x1530
[Tue Oct 26 12:23:36 2021]  ? psi_task_switch+0xc2/0x1f0
[Tue Oct 26 12:23:36 2021]  ? autoremove_wake_function+0x2c/0x50
[Tue Oct 26 12:23:36 2021]  schedule+0x59/0xc0
[Tue Oct 26 12:23:36 2021]  __scrub_blocked_if_needed+0xa0/0xf0 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  ? do_wait_intr_irq+0xa0/0xa0
[Tue Oct 26 12:23:36 2021]  scrub_pause_off+0x21/0x50 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  scrub_stripe+0x452/0x1580 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  ? do_wait_intr_irq+0xa0/0xa0
[Tue Oct 26 12:23:36 2021]  ? __btrfs_end_transaction+0xf6/0x210 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  ? scrub_chunk+0xcd/0x130 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  scrub_chunk+0xcd/0x130 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  scrub_enumerate_chunks+0x354/0x790 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  ? do_wait_intr_irq+0xa0/0xa0
[Tue Oct 26 12:23:36 2021]  btrfs_scrub_dev+0x23d/0x570 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  btrfs_ioctl+0x1410/0x2df0 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  ? __x64_sys_ioctl+0x82/0xb0
[Tue Oct 26 12:23:36 2021]  __x64_sys_ioctl+0x82/0xb0
[Tue Oct 26 12:23:36 2021]  do_syscall_64+0x5c/0x80
[Tue Oct 26 12:23:36 2021]  ? create_task_io_context+0xc7/0x110
[Tue Oct 26 12:23:36 2021]  ? get_task_io_context+0x48/0x80
[Tue Oct 26 12:23:36 2021]  ? set_task_ioprio+0x97/0xa0
[Tue Oct 26 12:23:36 2021]  ? __do_sys_ioprio_set+0x5e/0x300
[Tue Oct 26 12:23:36 2021]  ? syscall_exit_to_user_mode+0x23/0x40
[Tue Oct 26 12:23:36 2021]  ? syscall_exit_to_user_mode+0x23/0x40
[Tue Oct 26 12:23:36 2021]  ? do_syscall_64+0x69/0x80
[Tue Oct 26 12:23:36 2021]  ? exit_to_user_mode_prepare+0x77/0x170
[Tue Oct 26 12:23:36 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[Tue Oct 26 12:23:36 2021] RIP: 0033:0x7fba22e2559b
[Tue Oct 26 12:23:36 2021] RSP: 002b:00007fba22cf7c98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Tue Oct 26 12:23:36 2021] RAX: ffffffffffffffda RBX: 000056511ecc6bc0 RCX: 00007fba22e2559b
[Tue Oct 26 12:23:36 2021] RDX: 000056511ecc6bc0 RSI: 00000000c400941b RDI: 0000000000000003
[Tue Oct 26 12:23:36 2021] RBP: 0000000000000000 R08: 00007fba22cf8640 R09: 0000000000000000
[Tue Oct 26 12:23:36 2021] R10: 00007fba22cf8640 R11: 0000000000000246 R12: 00007ffec33d061e
[Tue Oct 26 12:23:36 2021] R13: 00007ffec33d061f R14: 0000000000000000 R15: 00007fba22cf8640
[Tue Oct 26 12:27:42 2021] INFO: task btrfs:341674 blocked for more than 122 seconds.
[Tue Oct 26 12:27:42 2021]       Tainted: G           OE     5.14.14-arch1-1 #1
[Tue Oct 26 12:27:42 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Oct 26 12:27:42 2021] task:btrfs           state:D stack:    0 pid:341674 ppid:341658 flags:0x00004000
[Tue Oct 26 12:27:42 2021] Call Trace:
[Tue Oct 26 12:27:42 2021]  __schedule+0x333/0x1530
[Tue Oct 26 12:27:42 2021]  ? autoremove_wake_function+0x2c/0x50
[Tue Oct 26 12:27:42 2021]  schedule+0x59/0xc0
[Tue Oct 26 12:27:42 2021]  __scrub_blocked_if_needed+0xa0/0xf0 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  ? do_wait_intr_irq+0xa0/0xa0
[Tue Oct 26 12:27:42 2021]  scrub_pause_off+0x21/0x50 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  scrub_stripe+0x452/0x1580 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  ? __update_load_avg_cfs_rq+0x27d/0x2e0
[Tue Oct 26 12:27:42 2021]  ? __btrfs_end_transaction+0xf6/0x210 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  ? kmem_cache_free+0x107/0x410
[Tue Oct 26 12:27:42 2021]  ? scrub_chunk+0xcd/0x130 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  scrub_chunk+0xcd/0x130 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  scrub_enumerate_chunks+0x354/0x790 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  ? do_wait_intr_irq+0xa0/0xa0
[Tue Oct 26 12:27:42 2021]  btrfs_scrub_dev+0x23d/0x570 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  btrfs_ioctl+0x1410/0x2df0 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  ? __x64_sys_ioctl+0x82/0xb0
[Tue Oct 26 12:27:42 2021]  __x64_sys_ioctl+0x82/0xb0
[Tue Oct 26 12:27:42 2021]  do_syscall_64+0x5c/0x80
[Tue Oct 26 12:27:42 2021]  ? create_task_io_context+0xc7/0x110
[Tue Oct 26 12:27:42 2021]  ? get_task_io_context+0x48/0x80
[Tue Oct 26 12:27:42 2021]  ? set_task_ioprio+0x97/0xa0
[Tue Oct 26 12:27:42 2021]  ? __do_sys_ioprio_set+0x5e/0x300
[Tue Oct 26 12:27:42 2021]  ? syscall_exit_to_user_mode+0x23/0x40
[Tue Oct 26 12:27:42 2021]  ? syscall_exit_to_user_mode+0x23/0x40
[Tue Oct 26 12:27:42 2021]  ? do_syscall_64+0x69/0x80
[Tue Oct 26 12:27:42 2021]  ? exit_to_user_mode_prepare+0x77/0x170
[Tue Oct 26 12:27:42 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[Tue Oct 26 12:27:42 2021] RIP: 0033:0x7fba22e2559b
[Tue Oct 26 12:27:42 2021] RSP: 002b:00007fba22cf7c98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Tue Oct 26 12:27:42 2021] RAX: ffffffffffffffda RBX: 000056511ecc6bc0 RCX: 00007fba22e2559b
[Tue Oct 26 12:27:42 2021] RDX: 000056511ecc6bc0 RSI: 00000000c400941b RDI: 0000000000000003
[Tue Oct 26 12:27:42 2021] RBP: 0000000000000000 R08: 00007fba22cf8640 R09: 0000000000000000
[Tue Oct 26 12:27:42 2021] R10: 00007fba22cf8640 R11: 0000000000000246 R12: 00007ffec33d061e
[Tue Oct 26 12:27:42 2021] R13: 00007ffec33d061f R14: 0000000000000000 R15: 00007fba22cf8640

Thanks,
Emil

Oct 19, 2021 14:46:46 Qu Wenruo [off-list ref]:
quoted

On 2021/10/19 20:38, Emil Heimpel wrote:
quoted
So it finished after 2 minutes?

[Tue Oct 19 14:13:51 2021] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
[Tue Oct 19 14:15:39 2021] BTRFS info (device sde1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished
Then this means it previously only have the dev replace status not
cleaned up.

Thus it ends pretty quick.
quoted

Now I at least have an expected filesystem show:

Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
         Total devices 6 FS bytes used 20.96TiB
         devid    1 size 7.28TiB used 5.46TiB path /dev/sde1
         devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
         devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
         devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
         devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
         devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1

And a nondegraded remount worked too.
And it's time for a full fs scrub to find out how consistent the fs is.

For btrfs RAID56, every time a unexpected shutdown is hit, a scrub is
strongly recommended.

And even routine scrub would be a plus for btrfs RAID56.

Thanks,
Qu
quoted
Thanks,
Emil

Oct 19, 2021 14:20:21 Qu Wenruo [off-list ref]:
quoted

On 2021/10/19 20:16, Emil Heimpel wrote:
quoted
Color me suprised:


[74713.072745] BTRFS info (device sde1): flagging fs with big metadata feature
[74713.072755] BTRFS info (device sde1): allowing degraded mounts
[74713.072758] BTRFS info (device sde1): using free space tree
[74713.072760] BTRFS info (device sde1): has skinny extents
[74713.104297] BTRFS warning (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
[74714.675001] BTRFS info (device sde1): bdev (efault) errs: wr 52950, rd 8161, flush 0, corrupt 1221, gen 0
[74714.675015] BTRFS info (device sde1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 228, gen 0
[74714.675025] BTRFS info (device sde1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 140, gen 0
[74751.033383] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
[bluemond@BlueQ ~]$ sudo btrfs replace status  -1 /mnt/btrfsrepair/
74.9% done, 0 write errs, 0 uncorr. read errs

I guess I just wait?
Yep, wait and stay alert, better to also keep an eye on the dmesg.

But this also means, previous replace didn't really finish, which may
mean the replace ioctl is not reporting the proper status, and can be a
possible bug.

Thanks,
Qu
quoted
Oct 19, 2021 13:37:09 Qu Wenruo [off-list ref]:
quoted

On 2021/10/19 18:49, Emil Heimpel wrote:
quoted
*failed*...
quoted
And ATA commands failure.

I don't believe the replace finished without problem, and the involved
device is /dev/sdd.
quoted
You won't want to see this message at all.

This means, you're running RAID56, as btrfs has write-hole problem,
which will degrade the robust of RAID56 byte by byte for each unclean
shutdown.

I guess the write hole problem has already make the repair failed for
the replace.

Thus after a successful mount, scrub and manually file checking is
almost a must.
quoted
This doesn't sound correct.

If a device is properly replaced, it should have the same devid number.

I guess you have tried to add a new device before, and then tried to
replace the missing device, right?


Anyway, have you tried to mount it degraded and then remove the missing
device?

Since you're using RAID56, I guess degrade mount should work.

Thanks,
Qu
quoted
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help