Re: RAID 6 Failure follow up

From: Andrew Dunn <hidden>
Date: 2009-11-08 14:30:21

storrgie@ALEXANDRIA:~$ dmesg | grep sdi
[   31.019358] sd 11:0:0:0: [sdi] 1953525168 512-byte logical blocks:
(1.00 TB/931 GiB)
[   31.032233] sd 11:0:0:0: [sdi] Write Protect is off
[   31.032235] sd 11:0:0:0: [sdi] Mode Sense: 73 00 00 08
[   31.037483] sd 11:0:0:0: [sdi] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[   31.066991]  sdi:
[   31.075719]  sdi1
[   31.124713] sd 11:0:0:0: [sdi] Attached SCSI disk
[   31.147407] md: bind<sdi1>
[   31.712366] raid5: device sdi1 operational as raid disk 4
[   31.713153]  disk 4, o:1, dev:sdi1
[   33.112975]  disk 4, o:1, dev:sdi1
[  297.528544] sd 11:0:0:0: [sdi] Sense Key : Recovered Error [current]
[descriptor]
[  297.528573] sd 11:0:0:0: [sdi] Add. Sense: ATA pass through
information available
[  297.591382] sd 11:0:0:0: [sdi] Sense Key : Recovered Error [current]
[descriptor]
[  297.591407] sd 11:0:0:0: [sdi] Add. Sense: ATA pass through
information available

I don't see anything glaring.

You should be able to force an assembly anyway (using the --force flag)
but I'd make sure you know exactly what the issue is first, otherwise
this is likely to happen again.

Do you think that the controller is dropping out? I know that I have 4
drives on one controller (AOC-USAS-L8i) and 5 drives on the other
controller (SAME make/model). but I think they are sequentially
connected... as in sd[efghi] should be on one device and sd[jklm] should
be on the other... any easy way to verify?

Roger Heflin wrote:

Andrew Dunn wrote:

quoted

This is kind of interesting:

storrgie@ALEXANDRIA:~$ sudo mdadm --assemble --force /dev/md0
mdadm: no devices found for /dev/md0

All of the devices are there in /dev, so I wanted to examine them:

storrgie@ALEXANDRIA:~$ sudo mdadm --examine /dev/sde1
/dev/sde1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 397e0b3f:34cbe4cc:613e2239:070da8c8 (local to host
ALEXANDRIA)
  Creation Time : Fri Nov  6 07:06:34 2009
     Raid Level : raid6
  Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
     Array Size : 6837318656 (6520.58 GiB 7001.41 GB)
   Raid Devices : 9
  Total Devices : 9
Preferred Minor : 0

    Update Time : Sun Nov  8 08:57:04 2009
          State : clean
 Active Devices : 5
Working Devices : 5
 Failed Devices : 4
  Spare Devices : 0
       Checksum : 4ff41c5f - correct
         Events : 43

     Chunk Size : 1024K

      Number   Major   Minor   RaidDevice State
this     0       8       65        0      active sync   /dev/sde1

   0     0       8       65        0      active sync   /dev/sde1
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8       97        2      active sync   /dev/sdg1
   3     3       8      113        3      active sync   /dev/sdh1
   4     4       0        0        4      faulty removed
   5     5       0        0        5      faulty removed
   6     6       0        0        6      faulty removed
   7     7       0        0        7      faulty removed
   8     8       8      193        8      active sync   /dev/sdm1

First raid device shows the failures....

One of the 'removed' devices:

storrgie@ALEXANDRIA:~$ sudo mdadm --examine /dev/sdi1
/dev/sdi1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 397e0b3f:34cbe4cc:613e2239:070da8c8 (local to host
ALEXANDRIA)
  Creation Time : Fri Nov  6 07:06:34 2009
     Raid Level : raid6
  Used Dev Size : 976759808 (931.51 GiB 1000.20 GB)
     Array Size : 6837318656 (6520.58 GiB 7001.41 GB)
   Raid Devices : 9
  Total Devices : 9
Preferred Minor : 0

    Update Time : Sun Nov  8 08:53:30 2009
          State : active
 Active Devices : 9
Working Devices : 9
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 4ff41b2f - correct
         Events : 21

     Chunk Size : 1024K

      Number   Major   Minor   RaidDevice State
this     4       8      129        4      active sync   /dev/sdi1

   0     0       8       65        0      active sync   /dev/sde1
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8       97        2      active sync   /dev/sdg1
   3     3       8      113        3      active sync   /dev/sdh1
   4     4       8      129        4      active sync   /dev/sdi1
   5     5       8      145        5      active sync   /dev/sdj1
   6     6       8      161        6      active sync   /dev/sdk1
   7     7       8      177        7      active sync   /dev/sdl1
   8     8       8      193        8      active sync   /dev/sdm1


Did you check dmesg and see if there were errors on those disks?

-- 
Andrew Dunn
http://agdunn.net

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help