AW: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices... | linux-raid

AW: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock

From: Großkreutz, Julian<Julian.Grosskreutz@med.uni-jena.de>
Date: 2014-01-14 14:00:27

Hi Phil,

great help, a lot of lessons learned on my part, thanks again.

I will not try to rescue the raid, time constraints forbid this but I will from now on implement a strict minimum hardware requirements policy :
-)

Regards

Julian

-----Ursprüngliche Nachricht-----
Von: Phil Turmel [mailto:philip@turmel.org]
Gesendet: Dienstag, 14. Januar 2014 14:15
An: Großkreutz, Julian; linux-raid@vger.kernel.org
Cc: neilb@suse.de
Betreff: Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock

On 01/14/2014 05:31 AM, Großkreutz, Julian wrote:

Hi Phil,

thanks again for bearing with me.

No problem.

quoted

Model: ATA ST3000DM001-9YN1 (scsi)

Aside: This model looks familiar.  I'm pretty sure these drives are
desktop models that lack scterc support.  Meaning they are *not*
generally suitable for raid duty.  Search the archives for
combinations of "timeout mismatch", "scterc", "URE", and "scrub" for
a full explanation.  If I've guessed correctly, you *must* use the
driver timeout work-around before proceeding.

Yes I did, and smartctl showed no significant problems.

?.  What did "smartctl -l scterc" say?  If it says unsupported, you have a problem.  The workaround is to set the driver timeouts to ~180 seconds for each such drive.

If scterc is supported, but disabled, you can set 7-second timeouts with "smartctl -l scterc,70,70", but you must do so on every power cycle.
Either way, you need boot-time scripting or distro support.

Raid-rated drives power up with a reasonable setting here.

The 10 year old
server (supermicro enterprise grade dual Xeon with 8 GB ECC RAM) had
started to create problems early January which is why I wanted to move
the drives to a new server in the first place, to then transfer the
data to a new set of enterprise grade disks. I had checked the memory
and the disks in a burn in for several days including time out and
power saving before I set up the raid 2012/2013, and did not have any issues then.

Ok.  This makes sense.

One of the reasons I tend use mdadm is that I am able to utilize
existing hardware to create bridging solutions until money comes in
for better hardware, and moving an mdadm raid has so far never created
a serious problem.

Many people discover the timeout problem the first time they have an otherwise correctable read error in their array, and the array falls apart instead.  This list's archives are well-populated with such cases.

quoted

So attached You will find hexdumps of 64k of /sda/sd[a-h]2 at sector
0 and 262144 which shows the superblock 1.2 on sd[fgh]2, not on
sd[a-e]2, but may help to identify data_offset; I suspect it is 2048
on sd[a-e]2 and 262144 on sd[fgh]2.

Jackpot!  LVM2 embedded backup data at the correct location for mdadm
data offset == 262144.  And on /dev/sda2, which is the only device
that should have it (first device in the raid).

From /dev/sda2 @ 262144:

quoted

00001200  76 67 5f 6e 65 64 69 67  73 30 32 20 5d 0a 69 64
|vg_nedigs02 ].id|
00001210  20 3d 20 22 32 4c 62 48  71 64 2d 72 67 42 9f 6e  | =
"2LbHqd-rgB.n|
00001220  45 4a 75 31 2d 32 52 36  31 2d 41 35 f5 75 2d 6e
|EJu1-2R61-A5.u-n|
00001230  49 58 53 2d 66 79 4f 36  33 73 22 0a 73 65 3a 01
|IXS-fyO63s".se:.|
00001240  6f 20 3d 20 33 36 0a 66  6f 72 6d 61 ca 24 3d 20  |o =
36.forma.$= |
00001250  22 6c 76 6d 32 22 20 23  20 69 6e 66 6f 72 6b ac  |"lvm2"
# infork.|

...

quoted

00001a70  20 31 33 37 35 32 38 37  39 37 39 09 23 20 d2 32  |
1375287979.# .2|
00001a80  64 20 4a 75 6c 20 33 31  20 31 38 3a af 37 3a 31  |d Jul
31 18:.7:1|
00001a90  39 20 32 30 31 33 0a 0a  00 00 00 00 00 00 ee 12  |9
2013..........|

Note the creation date/time at the end (with a corrupted byte):

Jul 31 18:?7:19 2013

There are other corrupted bytes scattered around.  I'd be worried
about the RAM in this machine.  Since you are using non-enterprise
drives, I'm going to go out on a limb here and guess that the server
doesn't have ECC ram...

see above

Understood.  With really old memory, double-faults in the ECC could have panic'd the server, leaving scattered data unwritten.

quoted

Consider performing an extended memcheck run to see what's going on.
Maybe move the entire stack of disks to another server.

Thats what I did initially, moved it back because it failed, now will
move again into the new server before proceeding.

Ok.

quoted

Based on the signature discovered above, we should be able to
--create --assume-clean with the modern default data offset.  We know
the following device roles:

/dev/sda2 == 0
/dev/sdf2 == 5
/dev/sdg2 == 6
/dev/sdh2 == spare

So /dev/sdh2 should be left out until the array is working.

Please re-execute the "mdadm -E" reports for /dev/sd[fgh]2 and show
them uncut.  (Use the lasted mdadm.)  That should fill in the likely
device order of the remaining drives.

Hmmm.  Typo on my part: s/lasted/latest/  Newer mdadm will give more information.  In particular, I wanted the tail of each report where each device lists what it last knew about all of the other devices' roles.

[root@livecd mnt]# mdadm -E /dev/sd[fgh]2

/dev/sdf2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
           Name : 1
  Creation Time : Wed Jul 31 18:24:38 2013
     Raid Level : raid6
   Raid Devices : 7

 Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
     Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
  Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d

    Update Time : Mon Dec 16 01:16:26 2013
       Checksum : ee921c43 - correct
         Events : 327

         Layout : left-symmetric
     Chunk Size : 256K

   Device Role : Active device 5
   Array State : A.AAAAA ('A' == active, '.' == missing)

I was expecting more info after this.

/dev/sdg2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
           Name : 1
  Creation Time : Wed Jul 31 18:24:38 2013
     Raid Level : raid6
   Raid Devices : 7

 Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
     Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
  Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : a1e1e51b:d8912985:e51207a9:1d718292

    Update Time : Mon Dec 16 01:16:26 2013
       Checksum : 4ef01fe9 - correct
         Events : 327

         Layout : left-symmetric
     Chunk Size : 256K

   Device Role : Active device 6
   Array State : A.AAAAA ('A' == active, '.' == missing)

And here.

/dev/sdh2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
           Name : 1
  Creation Time : Wed Jul 31 18:24:38 2013
     Raid Level : raid6
   Raid Devices : 7

 Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
     Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
  Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1

    Update Time : Mon Dec 16 01:16:26 2013
       Checksum : a1330e97 - correct
         Events : 327

         Layout : left-symmetric
     Chunk Size : 256K

   Device Role : spare
   Array State : A.AAAAA ('A' == active, '.' == missing)

And here.

quoted

Also, it is important that you document which drive serial numbers
are currently occupying the different device names.  An excerpt from
"ls -l /dev/disk/by-id/" would do.

scsi-SATA_ST3000DM001-9YN_S1F026VJ -> ../../sda
scsi-SATA_ST3000DM001-9YN_W1F0TB3C -> ../../sdb
scsi-SATA_ST3000DM001-9YN_S1F04KAK -> ../../sdc
scsi-SATA_ST3000DM001-9YN_W1F0RWJY -> ../../sdd
scsi-SATA_ST3000DM001-9YN_S1F08N7Q -> ../../sde
scsi-SATA_ST3000DM001-9YN_Z1F1F3TC -> ../../sdf
scsi-SATA_ST3000DM001-9YN_W1F1ZZ9T -> ../../sdg
scsi-SATA_ST3000DM001-9YN_Z1F1X0AC -> ../../sdh

Ok.  Be sure to recheck this list any time you boot, since the device order matters.

I am a bit more relaxed now because I found that a scheduled transfer
of the data to the university tape robot had completed before
christmas. So this local archive mirror is (luckily) not critical. I
still want to understand whether all this is just a result of shaky
hardware, or an mdadm (misuse) issue. Losing (all superblocks on) five
drives in a large software raid 6 instead of bytes is not something I
would like to repeat any time soon by ie. mishandling mdadm.

I think you skated over the edge due to a flaky motherboard.  mdadm can't fix that.  In fact, since you have a backup, I personally wouldn't bother further reconstruction efforts.  If you have a recent vgcfgbackup, it's doable, but I have little confidence in the device
order: [a????fg], probably [abcdefg].  There's 4! == 24 permutations there, each of which will require a vgcfgrestore before you can check the reconstruction with "fsck -n".

We have then

Wed Jul 31 18:24:38 2013 on sdf-h2 for creation of the raid6 and wed
Jul 31 18:?7:19 2013 for creation of the lvm group

could well be.

I don't see any way to get such a timestamp except "certainly was".

So I will move the disks to the new server, make 1:1 copies to new
drives and then attempt an assembly using --assume-clean in which
order ?

All permutations of [a????fg] with b, c, d, and e.

Try likely combinations gleaned from "mdadm -E" reports first to shortcut the process.

Thanks so much, I have learned a lot already.

You are welcome, and good luck.

Regards,

Phil


Universitätsklinikum Jena - Bachstrasse 18 - D-07743 Jena
Die gesetzlichen Pflichtangaben finden Sie unter http://www.uniklinikum-jena.de/Pflichtangaben.html

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help