Re: RAID5 with 2 drive failure at the same time

From: Robin Hill <hidden>
Date: 2013-01-31 22:10:07

On Thu Jan 31, 2013 at 10:46:17 -0700, Chris Murphy wrote:

On Jan 31, 2013, at 6:15 AM, Christoph Nelles [off-list ref] wrote:

quoted

All drives are available again. And the seecond failed device reports
UREs. I will run badblocks on that device before continuing.
I attached the kernel logs of the first error and of the second error. I
hope i filtered them reasonably.

This looks like a write error, resulting in md immediately booting the
drive. There's little point in using this drive again.

Jan 28 00:23:36 router kernel: Write(16): 8a 00 00 00 00 01 36 b2 55 50 00 00 00 30 00 00
Jan 28 00:23:36 router kernel: end_request: I/O error, dev sdg, sector 5212624208

It's definitely a write error, yes. If there's nothing further back in
the log (e.g. a read error that's caused a rewrite to take place) then
this would definitely warn against the drive, but could just be a
transient error (or a controller problem). If there is a read error
further back then I'd blame it on timeout issues, with the drive still
trying to complete the read operation while the kernel's timed out and
trying to send a write.

What does smartctl -a return for this drive?

quoted

Exactly. I am running badblocks on that device. SMART reports one
"Pending Sector Count" :(

I'm unclear on the efficacy of badblocks for testing. I'd use smartctl
-t long and then -a to see if there are sector problems and at what
LBA; and for removing bad blocks (force a remap) I'd use either dd
zeros with e.g. bs=1M, or I'd use ATA Secure Erase which is faster.

I don't usually bother with read tests - as you say, they're not
terribly useful. If the data's useful then just use ddrescue to get what
you can, otherwise just write-test it. I usually do a full destructive
badblocks test (I've found cases where zeros write fine but other
patterns fail), followed by a long SMART test.

If you use the badblocks map when formatting a drive, e.g. using
mkfs.ext4 -c, then it would allow you to use this disk but not in
RAID. On top of raid, md gets the write error before the file system
does, and boots the drive out of the array. Or on read error attempts
to correct it. And even as a standalone drive do you really want to
use a drive that can't remap future bad sectors?

Not a chance I'd use it if it's actually failing to remap bad sectors,
no. Only had that with one drive so far though (out of several hundred),
most get failed out after getting more than a handful of remapped
sectors.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        [off-list ref] |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

Attachments

(unnamed) [application/pgp-signature] 198 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help