Re: Help recovering filesystem (if possible)
From: Matthew Dawson <hidden>
Date: 2021-11-18 02:57:48
On Monday, November 15, 2021 5:46:43 A.M. EST Kai Krakow wrote:
Am Mo., 15. Nov. 2021 um 02:55 Uhr schrieb Matthew Dawson [off-list ref]:quoted
I recently upgrade one of my machines to the 5.15.2 kernel. on the first reboot, I had a kernel fault during the initialization (I didn't get to capture the printed stack trace, but I'm 99% sure it did not have BTRFS related calls). I then rebooted the machine back to a 5.14 kernel, but the BCache (writeback) cache was corrupted. I then force started the underlying disks, but now my BTRFS filesystem will no longer mount. I realize there may be missing/corrupted data, but I would like to ideally get any data I can off the disks.I had a similar issue lately where the system didn't reboot cleanly (there's some issue in the BIOS or with the SSD firmware where it would disconnect the SSD from SATA a few seconds after boot, forcing bcache into detaching dirty caches). Since you are seeing transaction IDs lacking behind expectations, I think you've lost dirty writeback data from bcache. Do fix this in the future, you should use bcache only in writearound or writethrough mode.
Considering I started the bcache devices without the cache, I don't doubt I've lost writeback data and I have no doubts there will be issues. At this point I'm just in data recovery, trying to get what I can.
quoted
This system involves 10 8TB disk, some are doing BCache -> LUKS -> BTRFS, some are doing LUKS -> BTRFS.Not LUKS here, and all my btrfs pool members are attached to a single SSD as caching frontend.quoted
When I try to mount the filesystem, I get the following in dmesg: [117632.798339] BTRFS info (device dm-0): flagging fs with big metadata feature [117632.798344] BTRFS info (device dm-0): disk space caching is enabled [117632.798346] BTRFS info (device dm-0): has skinny extents [117632.873186] BTRFS error (device dm-0): parent transid verify failed on 132806584614912 wanted 3240123 found 3240119I had luck with the following steps: * ensure that all members are attached to bcache as they should * ensure bcache is running in writearound mode for each member * ensure that btrfs did scan for all members Next, I started `btrfs check` for each member disk, eventually one would contain the needed disk structures and only showed a few errors. I was then able to mount btrfs through that device node, open ctree didn't fail this time. I don't remember if I used "usebackuproot" for mount or a similar switch for "btrfs check". I then ran `btrfs scrub` which fixed the broken metadata. Luckily, I had only metadata corruption on the disks which had dirty writeback cleared, and metadata runs in RAID-1 mode for me. "btrfs check" then didn't find any errors. Reboot worked fine.
Thanks for the suggestion. Unfortunately, all my disks report basically the same errors, so I wasn't able to recover my system this way.
[...]quoted
Is there any hope in recovering this data? Or should I give up on it at this point and reformat? Most of the data is backed up (or are backups themselves), but I'd like to get what I can.Well, I'm doing daily backups with borg - to a different technology (no btrfs, no bcache, different system). I don't think backing up btrfs to btrfs is a brilliant idea, especially not when both are mounted to the same system.
I'm not quite that redundant, but the backups of things I really care about are actually to an off-site system. But accessing data through a backup can be painful compared to hopefully just getting it out. Also the local backups on the system would be nice to have, for historical purposes.
You may try my steps above. If you've found a member device which shows fewer errors, you COULD try to repair it if mount still fails (or try one of the recovery mount options). But you may want to ask the experts again here.
I did try, thanks. Unfortunately as noted above it wasn't helpful. Hopefully someone has a different idea? I am posting here because I feel any luck is going to start using more dangerous options and those usually say to ask the mailing list first.
Depending on how much dirty writeback you've lost in bcache, chances may be good that one of the members has enough metadata to successfully mount or repair the filesystem. Or at least, it's a good start for "btrfs restore" then. What do we learn from this? * probably do not use bcache in writeback mode if you can avoid it * switch bcache to writearound mode before kernel upgrades, wait for writeback to finish * success mounting btrfs may depend a lot on which member device you actually mount
Thanks, -- Matthew