Re: Filesystem corruption after unreachable storage
From: Jean-Louis Dupond <hidden>
Date: 2020-02-20 09:14:29
The dumpe2fs seems to get blocked. Uploaded it here: http://dupondje.be/dumpe2fs.txt On 20/02/2020 10:08, Jean-Louis Dupond wrote:
As the mail seems to have been trashed somewhere, I'll retry :) Thanks Jean-Louis On 24/01/2020 21:37, Theodore Y. Ts'o wrote:quoted
On Fri, Jan 24, 2020 at 11:57:10AM +0100, Jean-Louis Dupond wrote:quoted
There was a short disruption of the SAN, which caused it to be unavailable for 20-25 minutes for the ESXi.20-25 minutes is "short"? I guess it depends on your definition / POV. :-)Well more downtime was caused to recover (due to manual fsck) then the time the storage was down :)quoted
quoted
What worries me is that almost all of the VM's (out of 500) were showing the same error.So that's a bit surprising...Indeed, that's were I thought, something went wrong here! I've tried to simulate it, and were able to simulate the same error when we let the san recover BEFORE VM is shutdown. If I poweroff the VM and then recover the SAN, it does an automatic fsck without problems. So it really seems it breaks when the VM can write again to the SAN.quoted
quoted
And even some (+-10) were completely corrupt.What do you mean by "completely corrupt"? Can you send an e2fsck transcript of file systems that were "completely corrupt"?Well it was moving a tons of files to lost+found etc. So that was really broken. I'll see if I can recover some backup of one in broken state. Anyway this was only a very small percentage, so worries me less then the rest :)quoted
quoted
Is there for example a chance that the filesystem gets corrupted the moment the SAN storage was back accessible?Hmm... the one possibility I can think of off the top of my head is that in order to mark the file system as containing an error, we need to write to the superblock. The head of the linked list of orphan inodes is also in the superblock. If that had gotten modified in the intervening 20-25 minutes, it's possible that this would result in orphaned inodes not on the linked list, causing that error. It doesn't explain the more severe cases of corruption, though.If fixing that would have left us with only 10 corrupt disks instead of 500, would be a big win :)quoted
quoted
I also have some snapshot available of a corrupted disk if some additional debugging info is required.Before e2fsck was run? Can you send me a copy of the output of dumpe2fs run on that disk, and then transcript of e2fsck -fy run on a copy of that snapshot?Sure: dumpe2fs -> see attachment Fsck: # e2fsck -fy /dev/mapper/vg01-root e2fsck 1.44.5 (15-Dec-2018) Pass 1: Checking inodes, blocks, and sizes Inodes that were part of a corrupted orphan linked list found. Fix? yes Inode 165708 was part of the orphaned inode list. FIXED. Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: -(863328--863355) Fix? yes Free blocks count wrong for group #26 (3485, counted=3513). Fix? yes Free blocks count wrong (1151169, counted=1151144). Fix? yes Inode bitmap differences: -4401 -165708 Fix? yes Free inodes count wrong for group #0 (2489, counted=2490). Fix? yes Free inodes count wrong for group #20 (1298, counted=1299). Fix? yes Free inodes count wrong (395115, counted=395098). Fix? yes /dev/mapper/vg01-root: ***** FILE SYSTEM WAS MODIFIED ***** /dev/mapper/vg01-root: 113942/509040 files (0.2% non-contiguous), 882520/2033664 blocksquoted
quoted
It would be great to gather some feedback on how to improve the situation (next to of course have no SAN outage :)).Something that you could consider is setting up your system to trigger a panic/reboot on a hung task timeout, or when ext4 detects an error (see the man page of tune2fs and mke2fs and the -e option for those programs). There are tradeoffs with this, but if you've lost the SAN for 15-30 minutes, the file systems are going to need to be checked anyway, and the machine will certainly not be serving. So forcing a reboot might be the best thing to do.Going to look into that! Thanks for the info.quoted
quoted
On KVM for example there is a unlimited timeout (afaik) until the storage is back, and the VM just continues running after storage recovery.Well, you can adjust the SCSI timeout, if you want to give that a try....It has some other disadvantages? Or is it quite safe to increment the SCSI timeout?quoted
Cheers, - Ted