Thread (25 messages) 25 messages, 7 authors, 2007-11-02

Re: Implementing low level timeouts within MD

From: Doug Ledford <hidden>
Date: 2007-10-27 23:55:29

On Sat, 2007-10-27 at 16:46 -0500, Alberto Alonso wrote:
On Fri, 2007-10-26 at 15:00 -0400, Doug Ledford wrote:
quoted
This isn't an md problem, this is a low level disk driver problem.  Yell
at the author of the disk driver in question.  If that driver doesn't
time things out and return errors up the stack in a reasonable time,
then it's broken.  Md should not, and realistically can not, take the
place of a properly written low level driver.
I am not arguing whether or not MD is at fault, I know it isn't. 

Regardless of the fact that it is not MD's fault, it does make
software raid an invalid choice when combined with those drivers. A
single disk failure within a RAID5 array bringing a file server down
is not a valid option under most situations.
Without knowing the exact controller you have and driver you use, I
certainly can't tell the situation.  However, I will note that there are
times when no matter how well the driver is written, the wrong type of
drive failure *will* take down the entire machine.  For example, on an
SPI SCSI bus, a single drive failure that involves a blown terminator
will cause the electrical signaling on the bus to go dead no matter what
the driver does to try and work around it.
I wasn't even asking as to whether or not it should, I was asking if
it could.
It could, but without careful control of timeouts for differing types of
devices, you could end up making the software raid less reliable instead
of more reliable overall.
 Should is a relative term, could is not. If the MD code
can not cope with poorly written drivers then a list of valid drivers
and cards would be nice to have (that's why I posted my ... when it
works and when it doesn't, I was trying to come up with such a list).
Generally speaking, most modern drivers will work well.  It's easier to
maintain a list of known bad drivers than known good drivers.
I only got 1 answer with brand specific information to figure out when
it works and when it doesn't work. My recent experience is that too
many drivers seem to have the problem so software raid is no longer
an option for any new systems that I build, and as time and money
permits I'll be switching to hardware/firmware raid all my legacy
servers.
Be careful which hardware raid you choose, as in the past several brands
have been known to have the exact same problem you are having with
software raid, so you may not end up buying yourself anything.  (I'm not
naming names because it's been long enough since I paid attention to
hardware raid driver issues that the issues I knew of could have been
solved by now and I don't want to improperly accuse a currently well
working driver of being broken)

-- 
Doug Ledford [off-list ref]
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help