Re: Errors after successful disk replace
From: Emil Heimpel <hidden>
Date: 2021-10-19 12:16:38
Color me suprised: [74713.072745] BTRFS info (device sde1): flagging fs with big metadata feature [74713.072755] BTRFS info (device sde1): allowing degraded mounts [74713.072758] BTRFS info (device sde1): using free space tree [74713.072760] BTRFS info (device sde1): has skinny extents [74713.104297] BTRFS warning (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing [74714.675001] BTRFS info (device sde1): bdev (efault) errs: wr 52950, rd 8161, flush 0, corrupt 1221, gen 0 [74714.675015] BTRFS info (device sde1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 228, gen 0 [74714.675025] BTRFS info (device sde1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 140, gen 0 [74751.033383] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74% [bluemond@BlueQ ~]$ sudo btrfs replace status -1 /mnt/btrfsrepair/ 74.9% done, 0 write errs, 0 uncorr. read errs I guess I just wait? Oct 19, 2021 13:37:09 Qu Wenruo [off-list ref]:
On 2021/10/19 18:49, Emil Heimpel wrote:quoted
Oct 19, 2021 07:35:54 Qu Wenruo [off-list ref]:quoted
On 2021/10/19 11:54, Emil Heimpel wrote:quoted
Hi all, One of my drives of a raid 5 btrfs array failed (was dead completely) so I installed an identical replacement drive. The dead drive was devid 1 and the new drive /dev/sde. I used the following to replace the missing drive: sudo btrfs replace start -B 1 /dev/sde1 /mnt/btrfsrepair/ and it completed successfully without any reported errors (took around 2 weeks though...). I then tried to see my array with filesystem show, but it hung (or took longer than I wanted to wait), so I did a reboot.Any dmesg of that time?Nothing after the replace finished: 1634463961.245751 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663044222976 for dev (efault) 1634463961.255819 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663045795840 for dev (efault) 1634463961.275815 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663046582272 for dev (efault) 1634463961.275922 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663047368704 for dev (efault) 1634463961.339074 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048155136 for dev (efault) 1634463961.339248 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048941568 for dev (efault)*failed*...quoted
1634475910.611261 BlueQ kernel: sd 9:0:2:0: attempting task abort!scmd(0x0000000046fead3f), outstanding for 7120 ms & timeout 7000 ms 1634475910.615126 BlueQ kernel: sd 9:0:2:0: [sdd] tag#840 CDB: ATA command pass through(16) 85 08 2e 00 00 00 01 00 00 00 00 00 00 00 ec 00 1634475910.615429 BlueQ kernel: scsi target9:0:2: handle(0x000b), sas_address(0x4433221105000000), phy(5) 1634475910.615691 BlueQ kernel: scsi target9:0:2: enclosure logical id(0x590b11c022f3fb00), slot(6)And ATA commands failure. I don't believe the replace finished without problem, and the involved device is /dev/sdd.quoted
1634475910.787911 BlueQ kernel: sd 9:0:2:0: task abort: SUCCESS scmd(0x0000000046fead3f) 1634475910.807083 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred 1634475949.877998 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred 1634525944.213931 BlueQ kernel: perf: interrupt took too long (3138 > 3137), lowering kernel.perf_event_max_sample_rate to 63600 1634533791.168760 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 22996545634304 for dev (efault) 1634552685.203559 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 23816815706112 for dev (efault)You won't want to see this message at all. This means, you're running RAID56, as btrfs has write-hole problem, which will degrade the robust of RAID56 byte by byte for each unclean shutdown. I guess the write hole problem has already make the repair failed for the replace. Thus after a successful mount, scrub and manually file checking is almost a must.quoted
1634558977.979621 BlueQ kernel: BTRFS info (device sdb1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished 1634560793.132731 BlueQ kernel: zram0: detected capacity change from 32610864 to 0 1634560793.169379 BlueQ kernel: zram: Removed device: zram0 1634560883.549481 BlueQ kernel: watchdog: watchdog0: watchdog did not stop! 1634560883.556038 BlueQ systemd-shutdown[1]: Syncing filesystems and block devices. 1634560883.572840 BlueQ systemd-shutdown[1]: Sending SIGTERM to remaining processes...quoted
quoted
It showed up after a reboot as followed: Label: 'BlueButter' uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3 Total devices 6 FS bytes used 20.96TiB devid 0 size 7.28TiB used 5.46TiB path /dev/sde1 devid 2 size 7.28TiB used 5.46TiB path /dev/sdb1 devid 3 size 2.73TiB used 2.73TiB path /dev/sdg1 devid 4 size 2.73TiB used 2.73TiB path /dev/sdd1 devid 5 size 7.28TiB used 4.81TiB path /dev/sdf1 devid 6 size 7.28TiB used 5.33TiB path /dev/sdc1 I then tried to mount it, but it failed, so I run a readonly check and it reported the following problem:And dmesg for the failed mount?Oops, I must have missed that it failed because of missing devid 1 too... 1634562944.145383 BlueQ kernel: BTRFS info (device sde1): flagging fs with big metadata feature 1634562944.145529 BlueQ kernel: BTRFS info (device sde1): force zstd compression, level 2 1634562944.145650 BlueQ kernel: BTRFS info (device sde1): using free space tree 1634562944.145697 BlueQ kernel: BTRFS info (device sde1): has skinny extents 1634562944.148709 BlueQ kernel: BTRFS error (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing 1634562944.148764 BlueQ kernel: BTRFS error (device sde1): failed to read chunk tree: -2 1634562944.185369 BlueQ kernel: BTRFS error (device sde1): open_ctree failedThis doesn't sound correct. If a device is properly replaced, it should have the same devid number. I guess you have tried to add a new device before, and then tried to replace the missing device, right? Anyway, have you tried to mount it degraded and then remove the missing device? Since you're using RAID56, I guess degrade mount should work. Thanks, Ququoted
quoted
Thanks, Ququoted
[...] [2/7] checking extents ERROR: super total bytes 38007432437760 smaller than real device(s) size 46008994590720 ERROR: mounting this fs may fail for newer kernels ERROR: this can be fixed by 'btrfs rescue fix-device-size' [3/7] checking free space tree [...] So I followed that advice but got the following error: sudo btrfs rescue fix-device-size /dev/sde1 ERROR: devid 1 is missing or not writeable ERROR: fixing device size needs all device(s) to be present and writeable So it seems something went wrong or didn't complete fully. What can I do to fix this problem? uname -a Linux BlueQ 5.14.12-arch1-1 #1 SMP PREEMPT Wed, 13 Oct 2021 16:58:16 +0000 x86_64 GNU/Linux btrfs --version btrfs-progs v5.14.2 Regards, Emil P.S.: Yes, I know, raid5 isn't stable but it works good enough for me ;) Metadata is raid1 btw...