Re: best common practice in case of degraded array with read errors

From: Robin Hill <hidden>
Date: 2009-11-16 22:07:22

On Mon Nov 16, 2009 at 10:32:30PM +0100, Mikael Abrahamsson wrote:

Hello.

I have a 6 drive raid5. One of the drives failed on me (totally), and when 
I replaced it (-add a new working drive) I had several sectors on another 
drive give me UNC errors, which made md kick that drive as well, and left 
me with a non-working array (with only 4 drives).

Are you running any regular array checks?  These should verify the
readability of the drives (and accuracy of the checksums).  This type of
failure's also why I've switched to RAID6 for most of my arrays.

What is the best common practice to handle this scenario? Right now I'm 
dd_rescue:ing the drive with read errors to a (hopefully) working drive, 
and then when I plan to --assemble --force the array to get 5 working 
drives (with a few zero:ed sectors where I guess I'll have corrupted 
files, hopefully no important metadata), and then I plan --add a 6th drive 
and have everything sync up and be back to "normal".

Is there a better way? I don't really understand why kicking drives out of 
the array when there aren't enough of them to keep going makes sense, is 
there some rationale I'm missing?

Technically, the best practice is probably to recreate the array from
scratch (replacing any failed drives) and restore from backup.  Short of
that, your approach would seem to be the best option.  I've done this in
the past, though I ended up restoring pretty much everything from backup
anyway (as I had no other way of verifying the integrity of the data).

I've also heard recommendations to write to the bad sectors on the 
existing drive, but that scares me as well in case I write to the wrong 
place, which is why I went the dd_rescue route (I'm also hoping that it'll 
retry a bit more and might be able to read the bad blocks...)

I'd leave that to later.  Once you've imaged the disk you can try
SMART tests, read/write tests, etc. to verify whether there's actually a
physical problem or not (and how much of one - a bad block or two might
be acceptable, but a lot of them would point to a failing disk).  Until
then you're better putting as little trust in the disk as possible.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        [off-list ref] |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

Attachments

(unnamed) [application/pgp-signature] 198 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help