Thread (5 messages) 5 messages, 3 authors, 2007-07-24

Re: raid5:md3: read error corrected , followed by , Machine Check

From: Bill Davidsen <hidden>
Date: 2007-07-24 18:59:01

Mr. James W. Laferriere wrote:

One more thought below...
On Mon, 23 Jul 2007, Bill Davidsen wrote:
quoted
Mr. James W. Laferriere wrote:
quoted
    Hello Andrew ,

On Tue, 17 Jul 2007, Andrew Burgess wrote:
quoted
quoted
    The 'MCE's have been ongoing for sometime .  I have replaced 
every item
in the system except the chassis & scsi backplane & power 
supply(750Watts) .
    Everything .  MB,cpu,memory,scsi controllers, ...
    These MCE's only happen when I am trying to build or bonnie++ 
test the
md3 .  It consists of (now 7+1spare) 146GB drives in the SuperMicro
SYS-6035B-8B's backplane attached to a LSI22320 .
Probably every old timer has a story about chasing a hardware problem
where changing the power supply finally fixed it. I keep spares now.

If an MCE (which means bad cpu) doesn't go away after changing the cpu
it would either have to be temperature, power or a bug in the MCE 
code.
What else could it be?
    Thank you for the idea of 'changing out the PS' .  So I did it a 
bit differant .  I removed the system PS from the raid backplane & 
dropped in a known good ps of proper wattage & re-tested .  But left 
the systems ps attached to only the MB & fans .
    It doesn't appear to be power load related .  I tried rebuilding 
my 7 disk raid6 array & I got the same thing ,  MCE .
    Now the raid backplane is still in the air stream in front of 
the cpu's and memory slots .  So it could be a marginal cpu or 
memory stick .

    But here's the clincher ,  when I don't use the two drives in 
from of the PS & cpu & memory slots .  The array completes it's 
resync .  So I'm back to testing memory (again) ,  If that passes 
then I'll try the new cpu(s) route .
It does sound like a cooling problem, which does not have to imply 
the overheated parts are bad, although that may be true.
    Fyi ,  memtest86+ @ 19 passes (~ 52hours) on 8GB of memory ,  no 
errors .
quoted
Could be the total number of i/o in flight, etc.
    Hmmm ,  I didn't think of this one .
Those are a PITA to find of that's it, doesn't sound likely to be power 
supply, as an unlikely but cheap test, have you reseated the p/s to 
backplane connectors? Oh and checked that the system board is grounded 
to the case?
quoted
Have you tried dropping two other drives?
    Well ,  no .  I dropped those two in front of the CPU as a test in 
working my way up the scsi backplane(BP) trying to find a point that 
worked & the last two drives in the BP just happened to be in front of 
the cpu/memory air path .  The minute I put those in the MD build tree 
within the usual time frame I get a MCE .  What I have'nt tried is 
what you are probably suggesting make sure it is the drives in the air 
path by putting them in the MD build and leaving another two out .  
I'll try that as well .
quoted
Can you put in a bit more fan?
    Nope ,  It's maxed out .  sounds like a 747 on take off as it is .
    It's a supermicro SYS-6035B-8B if you have the time to go look at 
the specs & pics .
What I was thinking is that some of my cases actually have room to 
install fans in front of the drives, allowing push as well as pull. 
Haven't had to do it in several years, but looking at my tall tower 
cases, I believe I could.
quoted
Read the system board and CPU temps with the "sensors" package?
    Not yet ,  I am building the need items into the kernel now .
    Will report back (hopefully) sometime this weekend .
Keep us posted, you have picked the low-hanging fruit, when you find out 
what causes this I'm sure it will be something interesting.

-- 
bill davidsen [off-list ref]
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help