Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less... | linux-btrfs

Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old

From: Sam Edwards <hidden>
Date: 2021-09-14 23:59:12

On Sun, Sep 12, 2021 at 5:07 PM Zygo Blaxell
[off-list ref] wrote:

There is less than 256MB distance from the first to the last, but they
occupy two separate 256MB-aligned regions (f800000000 and f810000000).
If there is dup metadata then these blocks occupy four separate 256MB
regions, as there is some space between duplicate regions (in logical
address space).

Okay, I see. You were looking at alignment, not range. I didn't think
you were trying to determine LBA alignment because you were working
from offsets within the LUKS space, without the benefit of the
LUKS+partition offsets. (Adding in those offsets doesn't change the
alignment situation much, by the way.)

Oh, for what it's worth: I don't think there was dup metadata. I
didn't pass any flags to mkfs.btrfs when I built the filesystem (so it
should have used the SSD default of single), and when I was (manually,
scanning the LUKS volume) looking for duplicate metadata blocks to
test my misdirected writes guess, I didn't find any two blocks with
the same bytenr. I still have the filesystem image in case there's any
doubt. But it also may not be pertinent at this point.

256MB is far too large for a plausible erase block size, 1GB is even
less likely.

Agreed - orders of magnitude too large, if you ask me. But I'm also
not discounting the possibility that the erase block size is big
enough to cover 256MiB worth of (SSD, not btrfs) metadata.

That's not how write caches work--they are not giant FIFO queues.

Oh no, I think I didn't make myself clear. I wasn't suggesting
anything to do with the order of the write operations, nor trying to
indicate that I was confused about what I was seeing.

I feel bad now; you provided a good explanation of this stuff, and I
appreciate the time taken to type all of that out, but I also didn't
benefit from the explanation as I'm nowhere near as green to memory
consistency as I am to btrfs internals. (Maybe someone reading this
email thread in the future will find this info useful?)

Hopefully I can rephrase what I was saying more clearly:
During the incident, there was some x% chance of writes not taking
effect. (And since there are a few metadata checksum errors indicating
torn writes, I think I can safely say that x<100. Probably by a wide
margin, given there are only 9000 unique error items, which is much
less than your expectation of 60K.)
The part that first caught my eye was that, for writes *not* to chunk
1065173909504, x=0. i.e. the probability of loss was conditional on
LBA. Note that I am only using the chunk as a spatial grouping, not
saying that the chunk itself is to blame.
What then made me find this interesting (a better word is, perhaps,
"distinctive"), was that the pattern of writes lost during
transactions 66552, 66591, 66655, and 66684 all followed these very
same statistics.

That eliminates a whole class of bugs (because certain cache
replacement policies and/or data structures cannot follow this
pattern, so if the SSD uses one of those for its write cache, it means
the culprit can't be the write cache) while making others more likely
(e.g. if the SSD is not respecting flush and the write cache is
keeping pages in a hash table bucketed by the first N bits of LBA,
then this could easily be a failure of the write cache's hash table).

And if nothing else, it gives us a way to recognize this particular
problem, in case someone else shows up with similar errors.

That often happens to drives as they fail.  Firmware reboots due to a
bug in error handling code or hardware fault, and doesn't remember what
was in its write cache.

Also if there is a transport failure and the host resets the bus, the
drive firmware might have its write cache forcibly erased before being
able to write it.  The spec says that doesn't happen, but some vendors
are demonstrably unable to follow spec.

If the problem is in the command queue block, then the writes are
being lost before the write cache is even involved.

That's an SSD-specific restatement of #1 (failure to persist data before
reporting successfully completed write to the host, and returning previous
versions of data on later reads of the same address).

But then here the writes are being lost *after* the write cache is
involved. In case #1, I could just opt out of the write cache to get
my SSD working reliably again, but I can't opt out of LBA remapping
since it's required for the longevity of the drive.

SSDs don't necessarily erase old blocks immediately--a large, empty or
frequently discarded SSD might not erase old blocks for months.

The blocks weren't erased even after a few days of analysis. But case
#3 is really unlikely either way, because LBA mapper bugs should take
a *bunch* of stuff down with them.

All of the above fit into the general category of "drive drops some
writes, out of order, when some triggering failure occurs."

Yep -- although I would include the nearby LBAs condition in the
symptoms list. That's statistically significant, especially for SSD!
:)

If you have
access to the drive's firmware on github, you could check out the code,
determine which bug is occurring, and send a pull request with the fix.
If you don't, usually the practical solution is to choose a different
drive vendor, unless you're ordering enough units to cause drive
manufacturer shareholders to panic when you stop.

Oh man, there's sometimes drive firmware on GitHub?! I would be
pleasantly surprised to find my SSD's firmware there. Alas, a cursory
search suggests it isn't.

I didn't choose the drive; it's a whitelabel OEM unit. If I can't get
in touch with the vendor "officially" (which I probably can't), I'll
try getting the attention of the employee/contractor responsible for
maintaining the firmware. I've had moderate success with that in the
past. That's a large part of my motivation for nailing down exactly
*what* in the SSD is misbehaving (if indeed the SSD is misbehaving),
to maximize my chances that such a bug report is deemed "worth their
time." :)

Yeah, if something horrible happened in the Linux 5.14 baremetal NVME
hardware drivers or PCIe subsystem in general, then it could produce
symptoms like these.  It wouldn't be the first time a regression in
other parts of Linux was detected by a flood of btrfs errors.

Device resets might trigger write cache losses and then all the above
"firmware" symptoms (but the firmware is not at fault, it is getting
disrupted by the host) (unless you are a stickler for the letter of the
spec that says write cache must be immune to host action).

On the other hand, a problem that surfaces only with a new version of
Linux isn't _necessarily_ a regression in Linux. My working hypothesis
currently is that there's a corner-case in the SSD firmware, where
certain (perfectly cromulent) sequences of NVMe commands would trigger
this bug, and the reason it has never been detected in testing is that
the first NVMe driver in the world to hit the corner-case is the
version shipped in Linux 5.14, after the SSD was released. That'd be a
nice outcome since Linux can spare other users a similar fate with the
addition of a device quirk that avoids the corner-case.

But I haven't ruled out what you've said about a full-blown Linux
regression, or even just a problem that happens every few weeks (due
to some internal counter in the SSD overflowing, perhaps -- the
problem did manifest awfully soon after the 2^16th transaction, after
all). All possibilities remain open. :)

Linux 5.14 btrfs on VMs seems OK.  I run tests continuously on new Linux
kernels to detect btrfs and lvm regressions early, and nothing like this
has happened on my humble fleet.  My test coverage is limited--it
won't detect a baremetal NVME transport issue, as that's handled by the
host kernel not the VM guest.

I wonder if there'd be some value in setting up PCIe passthrough
straight into the VM for some/all of your NVMe devices? Does your VM
host have an IOMMU to allow that? Any interest in me donating to you
my SSD (which is M.2) if I decide to replace it?

Sound methodology.

quoted

Wish me luck,

Good luck!

Thank you! And: start the clock. I've erased+rebuilt the filesystem
today and am continuing to use it as I was before. I'll be making a
second attempt at installing those package updates later as well.

If there are any other questions, I'll happily answer. Otherwise I
won't be following up unless/until I encounter more instances of this
bug. So, to anyone reading this list archive well in the future: if
this is the last message from me, it means the problem was one-off.
Sorry.

Signing off,
Sam

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help