Thread (9 messages) 9 messages, 3 authors, 2014-01-15

Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock

From: Phil Turmel <hidden>
Date: 2014-01-14 13:14:54

On 01/14/2014 05:31 AM, Großkreutz, Julian wrote:
Hi Phil,

thanks again for bearing with me.
No problem.
quoted
quoted
quoted
quoted
Model: ATA ST3000DM001-9YN1 (scsi)
Aside: This model looks familiar.  I'm pretty sure these drives are
desktop models that lack scterc support.  Meaning they are *not*
generally suitable for raid duty.  Search the archives for combinations
of "timeout mismatch", "scterc", "URE", and "scrub" for a full
explanation.  If I've guessed correctly, you *must* use the driver
timeout work-around before proceeding.
Yes I did, and smartctl showed no significant problems.
?.  What did "smartctl -l scterc" say?  If it says unsupported, you have
a problem.  The workaround is to set the driver timeouts to ~180 seconds
for each such drive.

If scterc is supported, but disabled, you can set 7-second timeouts with
"smartctl -l scterc,70,70", but you must do so on every power cycle.
Either way, you need boot-time scripting or distro support.

Raid-rated drives power up with a reasonable setting here.
The 10 year old
server (supermicro enterprise grade dual Xeon with 8 GB ECC RAM) had
started to create problems early January which is why I wanted to move
the drives to a new server in the first place, to then transfer the data
to a new set of enterprise grade disks. I had checked the memory and the
disks in a burn in for several days including time out and power saving
before I set up the raid 2012/2013, and did not have any issues then.
Ok.  This makes sense.
One of the reasons I tend use mdadm is that I am able to utilize
existing hardware to create bridging solutions until money comes in for
better hardware, and moving an mdadm raid has so far never created a
serious problem.
Many people discover the timeout problem the first time they have an
otherwise correctable read error in their array, and the array falls
apart instead.  This list's archives are well-populated with such cases.
quoted
quoted
So attached You will find hexdumps of 64k of /sda/sd[a-h]2 at sector 0
and 262144 which shows the superblock 1.2 on sd[fgh]2, not on sd[a-e]2,
but may help to identify data_offset; I suspect it is 2048 on sd[a-e]2
and 262144 on sd[fgh]2.
Jackpot!  LVM2 embedded backup data at the correct location for mdadm
data offset == 262144.  And on /dev/sda2, which is the only device that
should have it (first device in the raid).

From /dev/sda2 @ 262144:
quoted
00001200  76 67 5f 6e 65 64 69 67  73 30 32 20 5d 0a 69 64  |vg_nedigs02 ].id|
00001210  20 3d 20 22 32 4c 62 48  71 64 2d 72 67 42 9f 6e  | = "2LbHqd-rgB.n|
00001220  45 4a 75 31 2d 32 52 36  31 2d 41 35 f5 75 2d 6e  |EJu1-2R61-A5.u-n|
00001230  49 58 53 2d 66 79 4f 36  33 73 22 0a 73 65 3a 01  |IXS-fyO63s".se:.|
00001240  6f 20 3d 20 33 36 0a 66  6f 72 6d 61 ca 24 3d 20  |o = 36.forma.$= |
00001250  22 6c 76 6d 32 22 20 23  20 69 6e 66 6f 72 6b ac  |"lvm2" # infork.|
...
quoted
00001a70  20 31 33 37 35 32 38 37  39 37 39 09 23 20 d2 32  | 1375287979.# .2|
00001a80  64 20 4a 75 6c 20 33 31  20 31 38 3a af 37 3a 31  |d Jul 31 18:.7:1|
00001a90  39 20 32 30 31 33 0a 0a  00 00 00 00 00 00 ee 12  |9 2013..........|
Note the creation date/time at the end (with a corrupted byte):

Jul 31 18:?7:19 2013

There are other corrupted bytes scattered around.  I'd be worried about
the RAM in this machine.  Since you are using non-enterprise drives, I'm
going to go out on a limb here and guess that the server doesn't have
ECC ram...
see above
Understood.  With really old memory, double-faults in the ECC could have
panic'd the server, leaving scattered data unwritten.
quoted
Consider performing an extended memcheck run to see what's going on.
Maybe move the entire stack of disks to another server.
Thats what I did initially, moved it back because it failed, now will
move again into the new server before proceeding.
Ok.
quoted
Based on the signature discovered above, we should be able to --create
--assume-clean with the modern default data offset.  We know the
following device roles:

/dev/sda2 == 0
/dev/sdf2 == 5
/dev/sdg2 == 6
/dev/sdh2 == spare

So /dev/sdh2 should be left out until the array is working.

Please re-execute the "mdadm -E" reports for /dev/sd[fgh]2 and show them
uncut.  (Use the lasted mdadm.)  That should fill in the likely device
order of the remaining drives.
Hmmm.  Typo on my part: s/lasted/latest/  Newer mdadm will give more
information.  In particular, I wanted the tail of each report where each
device lists what it last knew about all of the other devices' roles.
[root@livecd mnt]# mdadm -E /dev/sd[fgh]2

/dev/sdf2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
           Name : 1
  Creation Time : Wed Jul 31 18:24:38 2013
     Raid Level : raid6
   Raid Devices : 7

 Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
     Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
  Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : d5a16cb2:ff41b9a5:cbbf12b7:3750026d

    Update Time : Mon Dec 16 01:16:26 2013
       Checksum : ee921c43 - correct
         Events : 327

         Layout : left-symmetric
     Chunk Size : 256K

   Device Role : Active device 5
   Array State : A.AAAAA ('A' == active, '.' == missing)
I was expecting more info after this.
/dev/sdg2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
           Name : 1
  Creation Time : Wed Jul 31 18:24:38 2013
     Raid Level : raid6
   Raid Devices : 7

 Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
     Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
  Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : a1e1e51b:d8912985:e51207a9:1d718292

    Update Time : Mon Dec 16 01:16:26 2013
       Checksum : 4ef01fe9 - correct
         Events : 327

         Layout : left-symmetric
     Chunk Size : 256K

   Device Role : Active device 6
   Array State : A.AAAAA ('A' == active, '.' == missing)
And here.
/dev/sdh2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 32d82f84:fe30ac2e:f589aaef:bdd3e4c7
           Name : 1
  Creation Time : Wed Jul 31 18:24:38 2013
     Raid Level : raid6
   Raid Devices : 7

 Avail Dev Size : 5858314240 (2793.46 GiB 2999.46 GB)
     Array Size : 29285793280 (13964.55 GiB 14994.33 GB)
  Used Dev Size : 5857158656 (2792.91 GiB 2998.87 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 030cb9a7:76a48b3c:b3448369:fcf013e1

    Update Time : Mon Dec 16 01:16:26 2013
       Checksum : a1330e97 - correct
         Events : 327

         Layout : left-symmetric
     Chunk Size : 256K

   Device Role : spare
   Array State : A.AAAAA ('A' == active, '.' == missing)
And here.
quoted
Also, it is important that you document which drive serial numbers are
currently occupying the different device names.  An excerpt from "ls -l
/dev/disk/by-id/" would do.
scsi-SATA_ST3000DM001-9YN_S1F026VJ -> ../../sda
scsi-SATA_ST3000DM001-9YN_W1F0TB3C -> ../../sdb
scsi-SATA_ST3000DM001-9YN_S1F04KAK -> ../../sdc
scsi-SATA_ST3000DM001-9YN_W1F0RWJY -> ../../sdd
scsi-SATA_ST3000DM001-9YN_S1F08N7Q -> ../../sde
scsi-SATA_ST3000DM001-9YN_Z1F1F3TC -> ../../sdf
scsi-SATA_ST3000DM001-9YN_W1F1ZZ9T -> ../../sdg
scsi-SATA_ST3000DM001-9YN_Z1F1X0AC -> ../../sdh
Ok.  Be sure to recheck this list any time you boot, since the device
order matters.
I am a bit more relaxed now because I found that a scheduled transfer of
the data to the university tape robot had completed before christmas. So
this local archive mirror is (luckily) not critical. I still want to
understand whether all this is just a result of shaky hardware, or an
mdadm (misuse) issue. Losing (all superblocks on) five drives in a large
software raid 6 instead of bytes is not something I would like to repeat
any time soon by ie. mishandling mdadm.
I think you skated over the edge due to a flaky motherboard.  mdadm
can't fix that.  In fact, since you have a backup, I personally wouldn't
bother further reconstruction efforts.  If you have a recent
vgcfgbackup, it's doable, but I have little confidence in the device
order: [a????fg], probably [abcdefg].  There's 4! == 24 permutations
there, each of which will require a vgcfgrestore before you can check
the reconstruction with "fsck -n".
We have then

Wed Jul 31 18:24:38 2013 on sdf-h2 for creation of the raid6 and
wed Jul 31 18:?7:19 2013 for creation of the lvm group

could well be.
I don't see any way to get such a timestamp except "certainly was".
So I will move the disks to the new server, make 1:1 copies to new
drives and then attempt an assembly using --assume-clean in which
order ?
All permutations of [a????fg] with b, c, d, and e.

Try likely combinations gleaned from "mdadm -E" reports first to
shortcut the process.
Thanks so much, I have learned a lot already.
You are welcome, and good luck.

Regards,

Phil

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help