Re: BTRFS critical (device sda2): corrupt leaf, bad key order: block=293438636032, root=1, slot=11
From: Eric Wolf <hidden>
Date: 2017-09-01 13:39:08
Okay, I have a hex editor open. Now what? Your instructions seems straightforward, but I have no idea what I'm doing. --- Eric Wolf (201) 316-6098 19wolf@gmail.com On Thu, Aug 31, 2017 at 4:11 PM, Hugo Mills [off-list ref] wrote:
On Thu, Aug 31, 2017 at 03:21:07PM -0400, Eric Wolf wrote:quoted
I've previously confirmed it's a bad ram module which I have already submitted an RMA for. Any advice for manually fixing the bits?What I'd do... use a hex editor and the contents of ctree.h as documentation to find the byte in question, change it back to what it should be, mount the FS, try reading the directory again, look up the csum failure in dmesg, edit the block again to fix up the csum, and it's done. (Yes, I've done this before, and I'm a massive nerd). It's also possible to use Hans van Kranenberg's btrfs-python to fix up this kind of thing, but I've not done it myself. There should be a couple of talk-throughs from Hans in various archives -- both this list (find it on, say, http://www.spinics.net/lists/linux-btrfs/), and on the IRC archives (http://logs.tvrrug.org.uk/logs/%23btrfs/latest.html).quoted
Sorry for top leveling, not sure how mailing lists work (again sorry if this message is top leveled, how do I ensure it's not?)Just write your answers _after_ the quoted text that you're replying to, not before. It's a convention, rather than a technical thing... Hugo.quoted
--- Eric Wolf (201) 316-6098 19wolf@gmail.com On Thu, Aug 31, 2017 at 2:59 PM, Hugo Mills [off-list ref] wrote:quoted
(Please don't top-post; edited for conversation flow) On Thu, Aug 31, 2017 at 02:44:39PM -0400, Eric Wolf wrote:quoted
On Thu, Aug 31, 2017 at 2:33 PM, Hugo Mills [off-list ref] wrote:quoted
On Thu, Aug 31, 2017 at 01:53:58PM -0400, Eric Wolf wrote:quoted
I'm having issues with a bad block(?) on my root ssd. dmesg is consistently outputting "BTRFS critical (device sda2): corrupt leaf, bad key order: block=293438636032, root=1, slot=11" "btrfs scrub stat /" outputs "scrub status for b2c9ff7b-[snip]-48a02cc4f508 scrub started at Wed Aug 30 11:51:49 2017 and finished after 00:02:55 total bytes scrubbed: 53.41GiB with 2 errors error details: verify=2 corrected errors: 0, uncorrectable errors: 2, unverified errors: 0" Running "btrfs check --repair /dev/sda2" from a live system stalls after telling me corrupt leaf etc etc then "11 12". CPU usage hits 100% and disk activity remains at 0.This error is usually attributable to bad hardware. Typically RAM, but might also be marginal power regulation (blown capacitor somewhere) or a slightly broken CPU. Can you show us the output of "btrfs-debug-tree -b 293438636032 /dev/sda2"?Here's the culprit: [snip]quoted
item 10 key (890553 EXTENT_DATA 0) itemoff 14685 itemsize 269 inline extent data size 248 ram 248 compress 0 item 11 key (890554 INODE_ITEM 0) itemoff 14525 itemsize 160 inode generation 5386763 transid 5386764 size 135 nbytes 135 block group 0 mode 100644 links 1 uid 100000 gid 100000 rdev 0 flags 0x0 item 12 key (856762 INODE_REF 31762) itemoff 14496 itemsize 29 inode ref index 2745 namelen 19 name: dpkg.statoverride.0 item 13 key (890554 EXTENT_DATA 0) itemoff 14340 itemsize 156 inline extent data size 135 ram 135 compress 0[snip] Note the objectid field -- the first number in the brackets after "key" for each item. This sequence of values should be non-decreasing. Thus, item 12 should have an objectid of 890554 to match the items either side of it, and instead it has 856762. In hex, these are:quoted
quoted
quoted
hex(890554)'0xd96ba'quoted
quoted
quoted
hex(856762)'0xd12ba' Which means you've had two bitflips close together:quoted
quoted
quoted
hex(856762 ^ 890554)'0x8400' Given that everything else is OK, and it's just one byte affected in the middle of a load of data that's really quite sensitive to errors, it's very unlikely that it's the result of a misplaced pointer in the kernel, or some other subsystem accidentally walking over that piece of RAM. It is, therefore, almost certainly your hardware that's at fault. I would strongly suggest running memtest86 on your machine -- I'd usually say a minimum of 8 hours, or longer if you possibly can (24 hours), or until you have errors reported. If you get errors reported in the same place on multiple passes, then it's the RAM. If you have errors scattered around seemingly at random, then it's probably your power regulation (PSU or motherboard). Sadly, btrfs check on its own won't be able to fix this, as it's two bits flipped. (It can cope with one bit flipped in the key, most of the time, but not two). It can be fixed manually, if you're familiar with a hex editor and the on-disk data structures. Hugo.-- Hugo Mills | "There's a Martian war machine outside -- they want hugo@... carfax.org.uk | to talk to you about a cure for the common cold." http://carfax.org.uk/ | PGP: E2AB1DE4 | Stephen Franklin, Babylon 5