Re: Recovering from an URE on a RAID5 rebuild/resize

From: Phil Turmel <hidden>
Date: 2013-01-26 23:40:54

On 01/25/2013 06:14 AM, Roman Mamedov wrote:

Hello,

Recently there has been some talk on this list, about probability of seeing an
URE during a RAID5 rebuild on modern large (e.g. 2TB) drives.

I would like to ask for some advice of what would be the best way to proceed
when such an URE is encountered. This is mostly theoretical, no real situation
at hand at the moment.

As I understand, a RAID5 that is being resized or rebuilt, has no redundancy;
it is essentially as reliable as a RAID0 of total members-1, or even less.

So on an unreadable sector that mdadm needs to read (because it has no
redundancy to recover it from), mdadm will:

  - mark the corresponding array member as "failed";
  - mark the one that was being rebuilt/resized onto as "spare";
  - and the whole array as down and "not enough members to start the array".

No.  On modern kernels, you have to experience multiple read errors in a
short time (compile time constant 20, if I recall correctly) before the
device is failed.  So for a single unrecoverable sector, or a small
number of them, the error will be passed to the filesystem, and possibly
on to the application.

Let's assume only a couple of sectors on that member were unreadable, and then
their readability was restored (either by drive replacement or by overwriting
them to making the drive remap), and I would be okay with losing data that was
in those sectors.

If you are in this situation, rewriting the files that contain the bad
sectors is an option, if the sectors are in a file at all.  If they hold
filesystem metadata, you might lose more.

What would be the best way to proceed from there?

1) With the array stopped, dd_rescue the array members onto new drives.
 Allow bad sectors to be replaced with zeroes, possibly keep a record of
the bad sector locations.  Set the original drives aside for later
forensics, if needed.

2) Start up with the new members.  Add another new drive and allow the
rebuild to finish.  fsck the filesystem and assess the corruption,
possibly rewriting files identified with the bad block data from (1).

3) Take one of the original drives, zero its superblock, add it to the
array, and reshape to raid6.

4) Use regular "check" scrubs with raid6 to never be in the situation again.

HTH,

Phil

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help