Thread (9 messages) 9 messages, 3 authors, 2014-01-15

Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock

From: Phil Turmel <hidden>
Date: 2014-01-15 13:35:22

On 01/15/2014 07:50 AM, Wilson Jonathan wrote:
On Tue, 2014-01-14 at 13:43 -0500, Phil Turmel wrote:
quoted
On 01/14/2014 12:47 PM, Wilson Jonathan wrote:

[trim /]
quoted
I understand the issue of "timeout" on drives that might perform long
error checking which then causes mdadm, via the device (block?) driver
issuing a time out, to then kick the drive. In this instance you allow
some time for a drive to try and fix things at the expense of a hung
array for a longer period of time.

I also understand that with scterc the drive gives up (in effect timing
its self out) when it hits the 7 second, or there about, mark and
subsequently mdadm kicks the drive out. In this specific instance the
idea is to kill a drive quickly to that the raid doesn't hang longer
than a few seconds.
No.  The intent is to fail the read without failing the controller channel.
Arrr, thanks for the clarification... I hadn't realised that instead of
the drive returning a "Error, I can't get the data, I'm dead in the
water" message it instead returned a "warning, I can't get the data, you
deal with it and get back to me, I'm still working" kind of affair.
Let me emphasize one point here:  while a drive is performing error
recovery, it *stops talking to the controller*.  The drive isn't
replying with a warning as you suggest--it isn't replying *at all*.
Modern desktop drives try *very hard* to recover bad sectors, under the
assumption that they have the only copy of the data.  Typically, they'll
work at it for two *minutes* or more.

The linux kernel driver will give up after 30 seconds and try to reset
the drive.  The drive firmware ignores the reset, possibly multiple
times, until it is done retrying the original read.  When it does
finally reset, it is too late--it's been bumped from the array.

But the drive didn't really fail, leading to:
quoted
When you, the admin, get around to looking, the drive is idle but
apparently fine.  (It gains a "pending" sector, which stays until the
drive is told to write over that spot.)

HTH,
It does, thanks for the information :-)
You are welcome.

Phil
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help