Re: recovering RAID5 from multiple disk failures

From: Phil Turmel <hidden>
Date: 2013-02-03 00:23:32

On 02/02/2013 06:08 PM, Chris Murphy wrote:

On Feb 2, 2013, at 2:56 PM, Michael Ritzert [off-list ref] 
wrote:

quoted

Chris Murphy [off-list ref] wrote:

quoted

Nevertheless, over an hour and a half is a long time if the file 
system were being updated at all. There'd definitely be 
data/parity mismatches for disk1.

After disk1 failed, the only write access should have been
metadata update when the filesystem was mounted.

This is significant.

Was it mounted ro?

quoted

I only read data from the filesystem thereafter. So only atime 
changes are to be expected, there, and only for a small number of 
files that I could capture before disk3 failed. I know which files 
are affected, and could leave them alone.

Even for a small number of files there could be dozens or hundreds
of chunks altered. I think conservatively you have to consider disk
1 out and mount in degraded mode.

quoted

If disk 1 is assumed to be useless, meaning force assemble the 
array in degraded mode; a URE or linux SCSI layer time out is to 
be avoided or the array as a whole fails. Every sector is
needed. So what do you think about raising the linux scsi layer
time out to maybe 2 minutes, and leaving the remaining drive's
SCT ERC alone so that they don't time out sooner, but rather go
into whatever deep recovery they have to in the hopes those bad 
sectors can be read?

echo 120 >/sys/block/sdX/device/timeout

I just tried that, but I couldn't see any effect. The error rate 
coming in is much higher than 1 every two minutes.

This timeout is not about error rate. And what the value should be 
depends on context. Normal operation you want the disk error
recovery to be short, so that the disk produces a bonafide URE, not a
SCSI layer timeout error. That way md will correct the bad sector.
That's what probably wasn't happening in your case, which allowed
bad sectors to accumulate until it was a real problem.

If you try to recover from the degraded array, this is the correct approach.

But now, for the cloning process, you want the disk error timeout to 
be long (or disabled) so that the disk has as long as possible to do 
ECC to recover each of these problematic sectors. But this also
means getting the SCSI layer timeout set to at least 1 second longer
than the longest recovery time for the drives, so that the SCSI layer
time out doesn't stop sector recovery during cloning. Now maybe the
disk still won't be able to recover all data from these bad sectors,
but it's your best shot IMO.

For the array assembled degraded (disk1 left out).

quoted

When I assemble the array, I will have all new disks (with good 
smart selftests...), so I wouldn't expect timeouts. Instead, junk 
data will be returned from the sectors in question¹. How will md 
react to that?

Well yeah, with the new drives, they won't report UREs. So there's
an ambiguity with any mismatch between data and parity chunks as to 
which is correct. Without a URE, md doesn't know that the data chunk 
is right or wrong with RAID 5.

Bingo.  Working from the copies guarantees you won't have correct data
where the UREs are.  (The copies are very good to have, of course.)

Phil may disagree, and I have to defer to his experience in this,
but I think the most conservative and best shot you have at getting
the 20GB you want off the array is this:

I do disagree.

The above, combined with:

I do know where the bad sectors are from the ddrescue report. We are
talking about less that 50kB bad data on disk1. Unfortunately, disk3
is worse, but there is no sector that is bad on both disks.

Leads me to recommend "mdadm --create --assume-clean" using the original
drives, taking care to specify the devices in the proper order (per
their "Raid Device" number in the --examine reports).  I still haven't
seen any data that definitively links specific serial numbers to
specific raid device numbers.  Please do that.

After re-creating the array, and setting all the drive timeouts to 7.0
seconds, issue a "check" scrub:

echo "check" >/sys/block/md0/md/sync_action

This should clean up the few pending sectors on disk #1 by
reconstruction from the others, and may very well do the same for disk #3.

If disk #3 gets kicked out at this point, assemble in degraded mode with
disk #2, #4, and a fresh copy of disk #1 (picking up the new superblock
and any fixes during the partial scrub).  Then "--add" a spare (wiped)
disk and let the array rebuild.

And grab your data.

Phil.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help