Thread (11 messages) 11 messages, 2 authors, 2021-10-26

Re: Errors after successful disk replace

From: Qu Wenruo <hidden>
Date: 2021-10-19 12:20:24


On 2021/10/19 20:16, Emil Heimpel wrote:
Color me suprised:


[74713.072745] BTRFS info (device sde1): flagging fs with big metadata feature
[74713.072755] BTRFS info (device sde1): allowing degraded mounts
[74713.072758] BTRFS info (device sde1): using free space tree
[74713.072760] BTRFS info (device sde1): has skinny extents
[74713.104297] BTRFS warning (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
[74714.675001] BTRFS info (device sde1): bdev (efault) errs: wr 52950, rd 8161, flush 0, corrupt 1221, gen 0
[74714.675015] BTRFS info (device sde1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 228, gen 0
[74714.675025] BTRFS info (device sde1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 140, gen 0
[74751.033383] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
[bluemond@BlueQ ~]$ sudo btrfs replace status  -1 /mnt/btrfsrepair/
74.9% done, 0 write errs, 0 uncorr. read errs

I guess I just wait?
Yep, wait and stay alert, better to also keep an eye on the dmesg.

But this also means, previous replace didn't really finish, which may
mean the replace ioctl is not reporting the proper status, and can be a
possible bug.

Thanks,
Qu
Oct 19, 2021 13:37:09 Qu Wenruo [off-list ref]:
quoted

On 2021/10/19 18:49, Emil Heimpel wrote:
quoted
Oct 19, 2021 07:35:54 Qu Wenruo [off-list ref]:
quoted

On 2021/10/19 11:54, Emil Heimpel wrote:
quoted
Hi all,

One of my drives of a raid 5 btrfs array failed (was dead completely) so I installed an identical replacement drive. The dead drive was devid 1 and the new drive /dev/sde. I used the following to replace the missing drive:

sudo btrfs replace start -B 1 /dev/sde1 /mnt/btrfsrepair/

and it completed successfully without any reported errors (took around 2 weeks though...).

I then tried to see my array with filesystem show, but it hung (or took longer than I wanted to wait), so I did a reboot.
Any dmesg of that time?
Nothing after the replace finished:

1634463961.245751 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663044222976 for dev (efault)
1634463961.255819 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663045795840 for dev (efault)
1634463961.275815 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663046582272 for dev (efault)
1634463961.275922 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663047368704 for dev (efault)
1634463961.339074 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048155136 for dev (efault)
1634463961.339248 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048941568 for dev (efault)
*failed*...
quoted
1634475910.611261 BlueQ kernel: sd 9:0:2:0: attempting task abort!scmd(0x0000000046fead3f), outstanding for 7120 ms & timeout 7000 ms
1634475910.615126 BlueQ kernel: sd 9:0:2:0: [sdd] tag#840 CDB: ATA command pass through(16) 85 08 2e 00 00 00 01 00 00 00 00 00 00 00 ec 00
1634475910.615429 BlueQ kernel: scsi target9:0:2: handle(0x000b), sas_address(0x4433221105000000), phy(5)
1634475910.615691 BlueQ kernel: scsi target9:0:2: enclosure logical id(0x590b11c022f3fb00), slot(6)
And ATA commands failure.

I don't believe the replace finished without problem, and the involved
device is /dev/sdd.
quoted
1634475910.787911 BlueQ kernel: sd 9:0:2:0: task abort: SUCCESS scmd(0x0000000046fead3f)
1634475910.807083 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
1634475949.877998 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
1634525944.213931 BlueQ kernel: perf: interrupt took too long (3138 > 3137), lowering kernel.perf_event_max_sample_rate to 63600
1634533791.168760 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 22996545634304 for dev (efault)
1634552685.203559 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 23816815706112 for dev (efault)
You won't want to see this message at all.

This means, you're running RAID56, as btrfs has write-hole problem,
which will degrade the robust of RAID56 byte by byte for each unclean
shutdown.

I guess the write hole problem has already make the repair failed for
the replace.

Thus after a successful mount, scrub and manually file checking is
almost a must.
quoted
1634558977.979621 BlueQ kernel: BTRFS info (device sdb1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished
1634560793.132731 BlueQ kernel: zram0: detected capacity change from 32610864 to 0
1634560793.169379 BlueQ kernel: zram: Removed device: zram0
1634560883.549481 BlueQ kernel: watchdog: watchdog0: watchdog did not stop!
1634560883.556038 BlueQ systemd-shutdown[1]: Syncing filesystems and block devices.
1634560883.572840 BlueQ systemd-shutdown[1]: Sending SIGTERM to remaining processes...



quoted
quoted
It showed up after a reboot as followed:

Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
         Total devices 6 FS bytes used 20.96TiB
         devid    0 size 7.28TiB used 5.46TiB path /dev/sde1
         devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
         devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
         devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
         devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
         devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1

I then tried to mount it, but it failed, so I run a readonly check and it reported the following problem:
And dmesg for the failed mount?
Oops, I must have missed that it failed because of missing devid 1 too...

1634562944.145383 BlueQ kernel: BTRFS info (device sde1): flagging fs with big metadata feature
1634562944.145529 BlueQ kernel: BTRFS info (device sde1): force zstd compression, level 2
1634562944.145650 BlueQ kernel: BTRFS info (device sde1): using free space tree
1634562944.145697 BlueQ kernel: BTRFS info (device sde1): has skinny extents
1634562944.148709 BlueQ kernel: BTRFS error (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
1634562944.148764 BlueQ kernel: BTRFS error (device sde1): failed to read chunk tree: -2
1634562944.185369 BlueQ kernel: BTRFS error (device sde1): open_ctree failed
This doesn't sound correct.

If a device is properly replaced, it should have the same devid number.

I guess you have tried to add a new device before, and then tried to
replace the missing device, right?


Anyway, have you tried to mount it degraded and then remove the missing
device?

Since you're using RAID56, I guess degrade mount should work.

Thanks,
Qu
quoted
quoted
Thanks,
Qu
quoted
[...]
[2/7] checking extents
ERROR: super total bytes 38007432437760 smaller than real device(s) size 46008994590720
ERROR: mounting this fs may fail for newer kernels
ERROR: this can be fixed by 'btrfs rescue fix-device-size'
[3/7] checking free space tree
[...]

So I followed that advice but got the following error:

sudo btrfs rescue fix-device-size /dev/sde1
ERROR: devid 1 is missing or not writeable
ERROR: fixing device size needs all device(s) to be present and writeable

So it seems something went wrong or didn't complete fully.
What can I do to fix this problem?

uname -a
Linux BlueQ 5.14.12-arch1-1 #1 SMP PREEMPT Wed, 13 Oct 2021 16:58:16 +0000 x86_64 GNU/Linux

btrfs --version
btrfs-progs v5.14.2

Regards,
Emil

P.S.: Yes, I know, raid5 isn't stable but it works good enough for me ;)
Metadata is raid1 btw...
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help