Re: [PATCH 2/3] ext4: introduce ext4_error_remove_page

From: Theodore Ts'o <tytso@mit.edu>
Date: 2012-10-26 18:47:01
Also in: linux-mm, lkml

On Fri, Oct 26, 2012 at 04:55:01PM +0000, Luck, Tony wrote:

I think that we know that the file *is* corrupted, not just "potentially".
We probably know the location of the corruption to cache-line granularity.
Perhaps better on systems where we have access to ecc syndrome bits,
perhaps worse ... we do have some errors where the low bits of the address
are not known.

Well, it's at least *possible* that it was only the ECC bits that got
flipped.  :-) Not likely, I'll grant!  (Or does the motherboard zero
out the entire cache-line on a hard ECC failure?)

I'm in total agreement that forcing a reboot or fsck is unhelpful here.

But what should we do?  We don't want to let the error be propagated. That
could cause a cascade of more failures as applications make bad decisions
based on the corrupted data.

Perhaps we could ask the filesystem to move the file to a top-level
"corrupted" directory (analogous to "lost+found") with some attached
metadata to help recovery tools know where the file came from, and the
range of corrupted bytes in the file? We'd also need to invalidate existing
open file descriptors (or less damaging - flag them to avoid the corrupted
area??). Whatever we do, it needs to be persistent across a reboot ... the
lost bits are not going to magically heal themselves.

Well, we could set a new attribute bit on the file which indicates
that the file has been corrupted, and this could cause any attempts to
open the file to return some error until the bit has been cleared.
This would persist across reboots.  The only problem is that system
administrators might get very confused (at least at first, when they
first run a kernel or a distribution which has this feature enabled).
Application programs could also get very confused when any attempt to
open or read from a file suddenly returned some new error code (EIO,
or should we designate a new errno code for this purpose, so there is
a better indication of what the heck was going on?)

Also, if we just log the message in dmesg, if the system administrator
doesn't find the "this file is corrupted" bit right away, they might
not be able to determine which part of the file was corrupted.  How
important is this?  If the file system supports extended attributes,
should we attempt to attach a new extended attribute with information
about the ECC failure?

I'm not sure it's worth it to go to these extents, but I could imagine
some customers wanting to have this sort of information.  Do we know
what their "nice to have" / "must have" requirements might be?

     	       	       	       	    	 - Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help