Thread (11 messages) 11 messages, 2 authors, 2021-10-26

Re: Errors after successful disk replace

From: Emil Heimpel <hidden>
Date: 2021-10-19 12:16:38

Color me suprised:


[74713.072745] BTRFS info (device sde1): flagging fs with big metadata feature
[74713.072755] BTRFS info (device sde1): allowing degraded mounts
[74713.072758] BTRFS info (device sde1): using free space tree
[74713.072760] BTRFS info (device sde1): has skinny extents
[74713.104297] BTRFS warning (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
[74714.675001] BTRFS info (device sde1): bdev (efault) errs: wr 52950, rd 8161, flush 0, corrupt 1221, gen 0
[74714.675015] BTRFS info (device sde1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 228, gen 0
[74714.675025] BTRFS info (device sde1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 140, gen 0
[74751.033383] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
[bluemond@BlueQ ~]$ sudo btrfs replace status  -1 /mnt/btrfsrepair/
74.9% done, 0 write errs, 0 uncorr. read errs

I guess I just wait?

Oct 19, 2021 13:37:09 Qu Wenruo [off-list ref]:

On 2021/10/19 18:49, Emil Heimpel wrote:
quoted
Oct 19, 2021 07:35:54 Qu Wenruo [off-list ref]:
quoted

On 2021/10/19 11:54, Emil Heimpel wrote:
quoted
Hi all,

One of my drives of a raid 5 btrfs array failed (was dead completely) so I installed an identical replacement drive. The dead drive was devid 1 and the new drive /dev/sde. I used the following to replace the missing drive:

sudo btrfs replace start -B 1 /dev/sde1 /mnt/btrfsrepair/

and it completed successfully without any reported errors (took around 2 weeks though...).

I then tried to see my array with filesystem show, but it hung (or took longer than I wanted to wait), so I did a reboot.
Any dmesg of that time?
Nothing after the replace finished:

1634463961.245751 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663044222976 for dev (efault)
1634463961.255819 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663045795840 for dev (efault)
1634463961.275815 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663046582272 for dev (efault)
1634463961.275922 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663047368704 for dev (efault)
1634463961.339074 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048155136 for dev (efault)
1634463961.339248 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048941568 for dev (efault)
*failed*...
quoted
1634475910.611261 BlueQ kernel: sd 9:0:2:0: attempting task abort!scmd(0x0000000046fead3f), outstanding for 7120 ms & timeout 7000 ms
1634475910.615126 BlueQ kernel: sd 9:0:2:0: [sdd] tag#840 CDB: ATA command pass through(16) 85 08 2e 00 00 00 01 00 00 00 00 00 00 00 ec 00
1634475910.615429 BlueQ kernel: scsi target9:0:2: handle(0x000b), sas_address(0x4433221105000000), phy(5)
1634475910.615691 BlueQ kernel: scsi target9:0:2: enclosure logical id(0x590b11c022f3fb00), slot(6)
And ATA commands failure.

I don't believe the replace finished without problem, and the involved
device is /dev/sdd.
quoted
1634475910.787911 BlueQ kernel: sd 9:0:2:0: task abort: SUCCESS scmd(0x0000000046fead3f)
1634475910.807083 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
1634475949.877998 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
1634525944.213931 BlueQ kernel: perf: interrupt took too long (3138 > 3137), lowering kernel.perf_event_max_sample_rate to 63600
1634533791.168760 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 22996545634304 for dev (efault)
1634552685.203559 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 23816815706112 for dev (efault)
You won't want to see this message at all.

This means, you're running RAID56, as btrfs has write-hole problem,
which will degrade the robust of RAID56 byte by byte for each unclean
shutdown.

I guess the write hole problem has already make the repair failed for
the replace.

Thus after a successful mount, scrub and manually file checking is
almost a must.
quoted
1634558977.979621 BlueQ kernel: BTRFS info (device sdb1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished
1634560793.132731 BlueQ kernel: zram0: detected capacity change from 32610864 to 0
1634560793.169379 BlueQ kernel: zram: Removed device: zram0
1634560883.549481 BlueQ kernel: watchdog: watchdog0: watchdog did not stop!
1634560883.556038 BlueQ systemd-shutdown[1]: Syncing filesystems and block devices.
1634560883.572840 BlueQ systemd-shutdown[1]: Sending SIGTERM to remaining processes...



quoted
quoted
It showed up after a reboot as followed:

Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
        Total devices 6 FS bytes used 20.96TiB
        devid    0 size 7.28TiB used 5.46TiB path /dev/sde1
        devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
        devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
        devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
        devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
        devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1

I then tried to mount it, but it failed, so I run a readonly check and it reported the following problem:
And dmesg for the failed mount?
Oops, I must have missed that it failed because of missing devid 1 too...

1634562944.145383 BlueQ kernel: BTRFS info (device sde1): flagging fs with big metadata feature
1634562944.145529 BlueQ kernel: BTRFS info (device sde1): force zstd compression, level 2
1634562944.145650 BlueQ kernel: BTRFS info (device sde1): using free space tree
1634562944.145697 BlueQ kernel: BTRFS info (device sde1): has skinny extents
1634562944.148709 BlueQ kernel: BTRFS error (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
1634562944.148764 BlueQ kernel: BTRFS error (device sde1): failed to read chunk tree: -2
1634562944.185369 BlueQ kernel: BTRFS error (device sde1): open_ctree failed
This doesn't sound correct.

If a device is properly replaced, it should have the same devid number.

I guess you have tried to add a new device before, and then tried to
replace the missing device, right?


Anyway, have you tried to mount it degraded and then remove the missing
device?

Since you're using RAID56, I guess degrade mount should work.

Thanks,
Qu
quoted
quoted
Thanks,
Qu
quoted
[...]
[2/7] checking extents
ERROR: super total bytes 38007432437760 smaller than real device(s) size 46008994590720
ERROR: mounting this fs may fail for newer kernels
ERROR: this can be fixed by 'btrfs rescue fix-device-size'
[3/7] checking free space tree
[...]

So I followed that advice but got the following error:

sudo btrfs rescue fix-device-size /dev/sde1
ERROR: devid 1 is missing or not writeable
ERROR: fixing device size needs all device(s) to be present and writeable

So it seems something went wrong or didn't complete fully.
What can I do to fix this problem?

uname -a
Linux BlueQ 5.14.12-arch1-1 #1 SMP PREEMPT Wed, 13 Oct 2021 16:58:16 +0000 x86_64 GNU/Linux

btrfs --version
btrfs-progs v5.14.2

Regards,
Emil

P.S.: Yes, I know, raid5 isn't stable but it works good enough for me ;)
Metadata is raid1 btw...
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help