Re: RAID5 with 2 drive failure at the same time

From: Robin Hill <hidden>
Date: 2013-02-01 13:34:55

On Thu Jan 31, 2013 at 03:40:00PM -0700, Chris Murphy wrote:

On Jan 31, 2013, at 3:10 PM, Robin Hill [off-list ref] wrote:

quoted

If there is a read error
further back then I'd blame it on timeout issues, with the drive still
trying to complete the read operation while the kernel's timed out and
trying to send a write.

I think we need the whole log for the time before the start of the
error1.txt file provided previously. And also I'd like to know which
/dev/ device was the first to have a problem, that instigated the
rebuild. And if during the rebuild if the file system was mounted rw,
and if any writes were done at all. If so, that probably nixes
--assume-clean. If it was rebuilding and not written to from the file
system, the disk being rebuilt shouldn't actually be out of sync with
the array state.

The timestamps on the logs show that sdg was the first to have a
problem. It'd also be useful to know whether sdg has been rewritten at
all since then (i.e. whether the testing was destructive or not), and
whether or not the array was written to at all since the failure of sdg.

The disk that needs spot sector repairs is the one with UREs, I think
that's sdj1. If that disk is dd'd to another disk, the new disk won't
produce UREs for sectors missing data, and the chunks comprised of
those sectors won't get rebuilt by md.

So the disk to possibly dd to another is the one with the write error,
sdg1. But only if the idea is to not use --assume-clean. That way a
reassemble can rebuild, and not encounter another write error on that
drive.

Yes, if sdg still contains valid array data (and the array wasn't
written since then) then it would definitely make more sense to recreate
the array using it, leaving sdj out for now. That'll require more work
checking mdadm versions and data offset values though. That'll avoid the
issues with the unreadable blocks on sdj.

quoted

Not a chance I'd use it if it's actually failing to remap bad sectors,
no. Only had that with one drive so far though (out of several hundred),
most get failed out after getting more than a handful of remapped
sectors.

I think I see a use case for badblocks destructive writes if the disk
doesn't support enhanced secure erase (which writes a pattern not just
zeros). Of on laptops where it's not possible to get a disk to reset
on sleep, allowing it to be unfrozen for the purposes of using secure
erase. But if available, secure erase is faster and wipes all sectors
even those without LBAs. For sure with SSDs it's what should be used.

I prefer badblocks myself - I can see exactly what it's doing and what
errors are seen. With secure erase you're dependent on the firmware
internals to tell you what's actually going on (and, depending on the
nature of the errors you're getting, this may already be suspect).

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        [off-list ref] |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

Attachments

(unnamed) [application/pgp-signature] 198 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help