Re: Filesystem corruption after unreachable storage

From: Jean-Louis Dupond <hidden>
Date: 2020-02-20 09:14:29

The dumpe2fs seems to get blocked.
Uploaded it here: http://dupondje.be/dumpe2fs.txt

On 20/02/2020 10:08, Jean-Louis Dupond wrote:

As the mail seems to have been trashed somewhere, I'll retry :)

Thanks
Jean-Louis


On 24/01/2020 21:37, Theodore Y. Ts'o wrote:

quoted

On Fri, Jan 24, 2020 at 11:57:10AM +0100, Jean-Louis Dupond wrote:

quoted

There was a short disruption of the SAN, which caused it to be 
unavailable
for 20-25 minutes for the ESXi.

20-25 minutes is "short"? I guess it depends on your definition / 
POV. :-)

Well more downtime was caused to recover (due to manual fsck) then the 
time the storage was down :)

quoted

What worries me is that almost all of the VM's (out of 500) were 
showing the
same error.

So that's a bit surprising...

Indeed, that's were I thought, something went wrong here!
I've tried to simulate it, and were able to simulate the same error 
when we let the san recover BEFORE VM is shutdown.
If I poweroff the VM and then recover the SAN, it does an automatic 
fsck without problems.
So it really seems it breaks when the VM can write again to the SAN.

quoted

And even some (+-10) were completely corrupt.

What do you mean by "completely corrupt"? Can you send an e2fsck
transcript of file systems that were "completely corrupt"?

Well it was moving a tons of files to lost+found etc. So that was 
really broken.
I'll see if I can recover some backup of one in broken state.
Anyway this was only a very small percentage, so worries me less then 
the rest :)

quoted

Is there for example a chance that the filesystem gets corrupted the 
moment
the SAN storage was back accessible?

Hmm... the one possibility I can think of off the top of my head is
that in order to mark the file system as containing an error, we need
to write to the superblock. The head of the linked list of orphan
inodes is also in the superblock. If that had gotten modified in the
intervening 20-25 minutes, it's possible that this would result in
orphaned inodes not on the linked list, causing that error.

It doesn't explain the more severe cases of corruption, though.

If fixing that would have left us with only 10 corrupt disks instead 
of 500, would be a big win :)

quoted

I also have some snapshot available of a corrupted disk if some 
additional
debugging info is required.

Before e2fsck was run? Can you send me a copy of the output of
dumpe2fs run on that disk, and then transcript of e2fsck -fy run on a
copy of that snapshot?

Sure:
dumpe2fs -> see attachment

Fsck:
# e2fsck -fy /dev/mapper/vg01-root
e2fsck 1.44.5 (15-Dec-2018)
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found. Fix? yes

Inode 165708 was part of the orphaned inode list.  FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(863328--863355)
Fix? yes

Free blocks count wrong for group #26 (3485, counted=3513).
Fix? yes

Free blocks count wrong (1151169, counted=1151144).
Fix? yes

Inode bitmap differences:  -4401 -165708
Fix? yes

Free inodes count wrong for group #0 (2489, counted=2490).
Fix? yes

Free inodes count wrong for group #20 (1298, counted=1299).
Fix? yes

Free inodes count wrong (395115, counted=395098).
Fix? yes


/dev/mapper/vg01-root: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/vg01-root: 113942/509040 files (0.2% non-contiguous), 
882520/2033664 blocks

quoted

It would be great to gather some feedback on how to improve the 
situation
(next to of course have no SAN outage :)).

Something that you could consider is setting up your system to trigger
a panic/reboot on a hung task timeout, or when ext4 detects an error
(see the man page of tune2fs and mke2fs and the -e option for those
programs).

There are tradeoffs with this, but if you've lost the SAN for 15-30
minutes, the file systems are going to need to be checked anyway, and
the machine will certainly not be serving. So forcing a reboot might
be the best thing to do.

Going to look into that! Thanks for the info.

quoted

On KVM for example there is a unlimited timeout (afaik) until the 
storage is
back, and the VM just continues running after storage recovery.

Well, you can adjust the SCSI timeout, if you want to give that a 
try....

It has some other disadvantages? Or is it quite safe to increment the 
SCSI timeout?

quoted

Cheers,

- Ted

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help