Thread (14 messages) 14 messages, 7 authors, 2010-06-04

Re: devices get kicked from RAID about once a month

From: Dan Christensen <hidden>
Date: 2010-06-04 15:56:55

Robin Hill [off-list ref] writes:
On Fri Jun 04, 2010 at 09:30:09AM -0400, Dan Christensen wrote:
quoted
what would the raid layer do when it got a read error? 
It reconstructs the data and attempts a write.  A write failure will
then fail the drive.
[...]
It does exactly the same on the read timeout.  The problem is that when
it sends the write, the drive is still busy attempting the read, so
ignores the write request (until it's free).  This then times out as
well, so the array assumes the drive has failed.
quoted
These questions are motivated from the following logic.  Since it is
generally recognized that quicker read errors (e.g. TLER) are good
for drives in a raid array, *increasing* the SATA timeouts seems like it
is going in the wrong direction.  Wouldn't it be better to have short
timeouts, but have the raid layer treat a timeout less seriously?
As has been stated, the RAID layer doesn't have any timeouts.  It's the
SCSI/ATA layer which is timing out the read/write and reporting a
failure to the RAID layer.  If the timeout at this level is increased
sufficiently then either the read will eventually succeed, or it'll
still fail but the write will then succeed (as the drive is no longer
busy) (or the write will fail and the disk is really failed).
Ok, I now understand the idea here.  Even if the SATA timeout were
reduced, there's nothing the raid layer can do until the drive is
ready to respond again.  So it makes sense to work around this by
increasing the SATA timeouts.

Thanks for the clarification!

Dan
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help