Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less... | linux-btrfs

Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old

From: Qu Wenruo <hidden>
Date: 2021-09-11 01:05:59


On 2021/9/11 上午6:34, Sam Edwards wrote:

On Fri, Sep 10, 2021 at 2:31 AM Qu Wenruo [off-list ref] wrote:

quoted


If you have hit read-only problem, then it's definitely btrfs' problem,
and we're very interesting to see what's the cause.

I'm in full agreement that RO is a sign of a malfunction; but I was
saying that the malfunction may be deeper in the stack, meaning the RO
is not necessarily due to a design/implementation error in btrfs, but
is rather the only sensible course of action when faced with certain
circumstances outside of its control.

quoted

Checksum error doesn't sound correct. Can be another indication of
missing writes.

But this also means, the corruption is even older.

Here is the checksum error (the very first indication of something
amiss on the day in question):
BTRFS warning (device dm-0): checksum verify failed on 1065332064256
wanted 0x04ca393a found 0xd5f0b823 level 0
BTRFS error (device dm-0): parent transid verify failed on
1065332064256 wanted 66552 found 66543

This is something more weird.

Firstly, this message is for the same block, then it means, the first
copy is not correct, but it's not completely garbage.

If it's completely garbage, its logical bytenr would not match (btrfs
checks the very basic things like logical bytenr/fsid, then check the csum).

Thus it looks like either the data is not correct, or the bytenr check
just passes by pure coincident (which I don't believe, as it also passed
fsid check).

Then the 2nd copy passed all other checks, but transid.

This looks very weird.
If it's btrfs causing the problem, both copy should have the same
problem, not just one copy with csum mismatch, another with transid
mismatch.

Since it's coupled directly with a transid mismatch, I think this just
means the csum tree is current while the node is not.

For metadata, the csum is inlined into the header, not in csum tree.
Csum tree is only for data.
Thus it's not possible for csum to mismatch with its data.

That is,
0x04ca393a is the correct checksum for generation 66552 of leaf
1065332064256, but that generation has gone missing and instead we
find generation 66543, which has checksum 0xd5f0b823.

Nope, the csum is inside the tree block (along with its bytenr and fsid).
It looks more like the latter part of the tree block mirror 1 got
overwritten or corrupted.

quoted

As long as you don't do forced shutdown, nor btrfs check --repair, the
v1 cache should not mismatch.

I have never even heard of btrfs check --repair until this incident (a
testament to btrfs's general durability).

I checked and both shutdowns immediately before the "has wrong amount
of free space" warnings were clean. On the 10th, there was an unclean
shutdown a little earlier in the day - there may have been some
leftover issues from that.

I guess all the problems happens at that unclean shutdown.

quoted

This is too crazy that I can't even imagine what could survive.

[...]

But to me, this is really too crazy...

It is a contrived idea, yes. :)

But the subject it explores is relevant: how btrfs would react to a
slice of an active partition spontaneously reverting to the data it
held several minutes prior.

That means the disks are not respecting FLUSH.

Flush commands mean, the disk should only return after all the data in
volatile cache has been written to disk or non-volatile cache.

If the disks (including dm-layer) just return without really writing
back all the data to non-volatile storage, then a power loss happens,
that's exactly what the transid mismatch would happen.

That is, a "missing writes" problem where the writes don't go missing
until a few minutes *after* they succeed.

quoted

The final protection is the logical bytenr, where btrfs can map its
chunks at any logical bytenr, thus even at the same physical location,
they can have different logical bytenr.

THIS is an interesting lead. Until this point I had been interpreting
bytenr as a physical partition offset. Now that I've learned about the
chunk tree, I found that all missing writes were to chunk
1065173909504.

That can be caused by the fact that all the newer metadata writes were
just allocated inside chunk 1065173909504.

That chunk has a physical offset of 999675658240. So, there is exactly
a 61 GiB difference between bytenr and partition offset. (This seems
to be true of the neighboring chunks as well, and as that's a nice
round number, I think this is the correct offset.) >
However, if the physical offset were to change momentarily (i.e. for a
few minutes), then writes to that chunk would end up diverted to some
other location on disk. Once the physical offset changes back, the
chunk will appear to revert back to the same data it held a few
minutes prior. In effect, causing the "retroactive missing writes"
phenomenon I'm seeing.

I don't think there is anything related to sudden physical offset
change, or kernel will report things like "bad tree block start, want
%llu have %llu".

Thus I still think there are something between btrfs and the disks, that
causes FLUSH commands to be incorrectly executed.

Thanks,
Qu

This would also leave behind evidence, in that the missing writes
would have to have gone *somewhere* on disk, and as long as they
weren't overwritten, I can track them down by scanning the whole disk
for tree nodes with the proper bytenr/transid. I think I'll spend some
time today trying to do that, as that would confirm this idea.

The only remaining question is why the physical offset would have
changed for only a few minutes. I didn't do a rebalance, although I
think I was running low on available space around the time, so maybe
btrfs was making some last-minute adjustments to the chunk tree to
compensate for that? The transid of the chunk tree node describing
this chunk is 58325, which is well before the problems started
happening. Perhaps the chunk tree was updated in-memory, used for
physical writes, but then reverted? Does this sound like something
btrfs might do?

Or maybe a cosmic ray flipped a bit in the in-memory copy of the
physical offset of the chunk. Unlikely, but possible. :)

quoted

I'd say, there is no way to repair.
Only data salvage is possible for generic transid mismatch.

Bah. Well, not a problem -- but that will take me a fair amount of
time. I'll want to investigate this and figure out what went wrong
*before* I go through the trouble of recreating my filesystem. I don't
want to spend a day restoring backups only to have the same problem
happen again a week later.

quoted

Or is the whole a sudden power loss, nor do any btrfs check --repair between, the "wrong amount of free space" warning is already an indicator of something FUBAR at this point and I should just zero the
partition and restore from backups?

I guess so.

The repair for transid is never ensured to be safe, as core btrfs
mechanism is already broken.

Thanks,
Qu

quoted

Thank you for your time,
Sam

quoted

Thanks,
Qu

quoted

Cheers,
Sam


On Wed, Sep 8, 2021 at 6:47 PM Sam Edwards [off-list ref] wrote:

quoted

Hello list,

First, I should say that there's no urgency here on my part.
Everything important is very well backed up, and even the
"unimportant" files (various configs) seem readable. I imaged the
partition without even attempting a repair. Normally, my inclination
would be to shrug this off and recreate the filesystem.

However, I'd like to help investigate the root cause, because:
1. This happened suspiciously soon (see my timeline in the link below)
after upgrading to kernel 5.14.1, so may be a serious regression.
2. The filesystem was created less than 5 weeks ago, so the possible
causes are relatively few.
3. My last successful btrfs scrub was just before upgrading to 5.14.1,
hopefully narrowing possible root causes even more.
4. I have imaged the partition and am thus willing to attempt risky
experimental repairs. (Mostly for the sake of reporting if they work.)

Disk setup: NVMe SSD, GPT partition, dm-crypt, btrfs as root fs (no LVM)
OS: Gentoo
Earliest kernel ever used: 5.10.52-gentoo
First kernel version used for "real" usage: 5.13.8
Relevant information: See my Gist,
https://gist.github.com/CFSworks/650280371fc266b2712d02aa2f4c24e8
Misc. notes: I have run "fstrim /" on occasion, but don't have
discards enabled automatically. I doubt TRIM is the culprit, but I
can't rule it out.

My primary hypothesis is that there's some write bug in Linux 5.14.1.
I installed some package updates right before btrfs detected the
problem, and most of the files in the `btrfs check` output look like
they were created as part of those updates.

My secondary hypothesis is that creating and/or using the swapfile
caused some kind of silent corruption that didn't become a detectable
issue until several further writes later.

Let me know if there's anything else I should try/provide!

Regards,
Sam

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help