Thread (9 messages) 9 messages, 2 authors, 2013-10-21

Re: Advice recovering from interrupted grow on RAID5 array

From: John Yates <hidden>
Date: 2013-10-21 16:29:15

On Sun, Oct 20, 2013 at 9:09 PM, NeilBrown [off-list ref] wrote:
On Thu, 17 Oct 2013 01:36:28 -0400 John Yates [off-list ref] wrote:
quoted
On Wed, Oct 16, 2013 at 8:07 PM, NeilBrown [off-list ref] wrote:
quoted
On Wed, 16 Oct 2013 09:02:52 -0400 John Yates [off-list ref] wrote:
quoted
On Wed, Oct 16, 2013 at 1:26 AM, NeilBrown [off-list ref] wrote:
quoted
On Mon, 14 Oct 2013 21:59:45 -0400 John Yates [off-list ref] wrote:
quoted
Midway through a RAID5 grow operation from 5 to 6 USB connected
drives, system logs show that the kernel lost communication with some
of the drive ports which has left my array in a state that I have not
been able to reassemble. After reseating the cable connections and
rebooting, all of the drives appear to be functioning normally, so
hopefully the data is still intact. I need advice on recovery steps
for the array.

It appears that each drive failed in quick succession with /dev/sdc1
being the last standing and having the others marked as missing in its
superblock. The superblocks of the other drives show all drives as
available. (--examine output below)
quoted
mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1
mdadm: too-old timestamp on backup-metadata on device-5
mdadm: If you think it is should be safe, try 'export MDADM_GROW_ALLOW_OLD=1'
mdadm: /dev/md127 assembled from 1 drives - not enough to start the array.
Did you try following the suggestion and run

 export MDADM_GROW_ALLOW_OLD=1

and the try the --asssemble again?

NeilBrown
Yes I did, thanks. Not much change though. It accepts the timestamp,
but then appears not to use it.

mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
/dev/sdf1 /dev/sdg1 --verbose
mdadm: looking for devices for /dev/md127
mdadm: /dev/sdb1 is identified as a member of /dev/md127, slot 4.
mdadm: /dev/sdc1 is identified as a member of /dev/md127, slot 3.
mdadm: /dev/sdd1 is identified as a member of /dev/md127, slot 2.
mdadm: /dev/sde1 is identified as a member of /dev/md127, slot 0.
mdadm: /dev/sdf1 is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdg1 is identified as a member of /dev/md127, slot 5.
mdadm: :/dev/md127 has an active reshape - checking if critical
section needs to be restored
mdadm: accepting backup with timestamp 1381360844 for array with
timestamp 1381729948
mdadm: backup-metadata found on device-5 but is not needed
mdadm: added /dev/sdf1 to /dev/md127 as 1
mdadm: added /dev/sdd1 to /dev/md127 as 2
mdadm: added /dev/sdc1 to /dev/md127 as 3
mdadm: added /dev/sdb1 to /dev/md127 as 4 (possibly out of date)
mdadm: added /dev/sdg1 to /dev/md127 as 5 (possibly out of date)
mdadm: added /dev/sde1 to /dev/md127 as 0
mdadm: /dev/md127 assembled from 4 drives - not enough to start the array.

What about with MDADM_GROW_ALLOW_OLD=1 *and* --force ??

If that doesn't work, please add --verbose as well, and report the output.

NeilBrown
Thanks Neil. I had tried that as well (output below). I'm wondering if
there is a way to fix the metadata for /dev/sdc1 since that seems to
be the odd one where the --examine data indicates that the other disks
are all bad when I don't believe they really are (just the result of a
partial kernel or driver crash). I have read about some people zeroing
the superblock on a device so that it can be recreated, but I am not
sure exactly how that works and am hesitant to try it since a reshape
was in progress. I have also read about people having had success by
re-running the original mdadm --create while leaving the data intact,
but again I am hesitant to try that, especially because of the reshape
state.

Or... maybe this all has more to do with the Update Time, since the
output seems to indicate 4 drives are usable. All of the drives have
the same Update Time except for /dev/sdc1 which is about 5 minutes
later than the rest. Since it is the fourth device, perhaps the
assemble is satisfied with devices 0, 1, 2, 3, but then seeing an
Update Time on devices 4 and 5 that is earlier than device 3, it
marks them as "possibly out of date" and stops trying to assemble the
array. Hard to tell, but I still would not have any idea how to
overcome that scenario. I appreciate your help!

# export MDADM_GROW_ALLOW_OLD=1
# mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
/dev/sdf1 /dev/sdg1 --force --verbose
mdadm: looking for devices for /dev/md127
mdadm: /dev/sdb1 is identified as a member of /dev/md127, slot 4.
mdadm: /dev/sdc1 is identified as a member of /dev/md127, slot 3.
mdadm: /dev/sdd1 is identified as a member of /dev/md127, slot 2.
mdadm: /dev/sde1 is identified as a member of /dev/md127, slot 0.
mdadm: /dev/sdf1 is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdg1 is identified as a member of /dev/md127, slot 5.
mdadm: :/dev/md127 has an active reshape - checking if critical
section needs to be restored
mdadm: accepting backup with timestamp 1381360844 for array with
timestamp 1381729948
mdadm: backup-metadata found on device-5 but is not needed
mdadm: added /dev/sdf1 to /dev/md127 as 1
mdadm: added /dev/sdd1 to /dev/md127 as 2
mdadm: added /dev/sdc1 to /dev/md127 as 3
mdadm: added /dev/sdb1 to /dev/md127 as 4 (possibly out of date)
mdadm: added /dev/sdg1 to /dev/md127 as 5 (possibly out of date)
mdadm: added /dev/sde1 to /dev/md127 as 0
mdadm: /dev/md127 assembled from 4 drives - not enough to start the array.
That shouldn't happen.  With '-f' it should force the event count of either b1
or g1 (or maybe both) to match the others.

What version of mdadm are you using? (mdadm -V)
mdadm - v3.3 - 3rd September 2013
(Arch Linux)
Maybe try the latest
  git clone git://git.neil.brown.name/mdadm
  cd mdadm
  make mdadm
  ./mdadm .....

NeilBrown
OK, trying the latest...

# ./mdadm -V
mdadm - v3.3-27-ga4921f3 - 16th October 2013

# uname -rv
3.11.4-1-ARCH #1 SMP PREEMPT Sat Oct 5 21:22:51 CEST 2013

No change in the result and I don't see errors anywhere indicating a
problem writing to /dev/sdb1 or /dev/sdg1. Are there any more debug
options that I am overlooking?

# ./mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1
/dev/sde1 /dev/sdf1 /dev/sdg1 -f -v
mdadm: looking for devices for /dev/md127
mdadm: /dev/sdb1 is identified as a member of /dev/md127, slot 4.
mdadm: /dev/sdc1 is identified as a member of /dev/md127, slot 3.
mdadm: /dev/sdd1 is identified as a member of /dev/md127, slot 2.
mdadm: /dev/sde1 is identified as a member of /dev/md127, slot 0.
mdadm: /dev/sdf1 is identified as a member of /dev/md127, slot 1.
mdadm: /dev/sdg1 is identified as a member of /dev/md127, slot 5.
mdadm: :/dev/md127 has an active reshape - checking if critical
section needs to be restored
mdadm: accepting backup with timestamp 1381360844 for array with
timestamp 1381729948
mdadm: backup-metadata found on device-5 but is not needed
mdadm: added /dev/sdf1 to /dev/md127 as 1
mdadm: added /dev/sdd1 to /dev/md127 as 2
mdadm: added /dev/sdc1 to /dev/md127 as 3
mdadm: added /dev/sdb1 to /dev/md127 as 4 (possibly out of date)
mdadm: added /dev/sdg1 to /dev/md127 as 5 (possibly out of date)
mdadm: added /dev/sde1 to /dev/md127 as 0
mdadm: /dev/md127 assembled from 4 drives - not enough to start the array.

# ./mdadm --examine /dev/sd[bcdefg]1 | egrep '/dev/sd|Events|Update|Role|State'
/dev/sdb1:
          State : clean
    Update Time : Mon Oct 14 01:52:28 2013
         Events : 155279
   Device Role : Active device 4
   Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdc1:
          State : clean
    Update Time : Mon Oct 14 01:57:26 2013
         Events : 155281
   Device Role : Active device 3
   Array State : ...A.. ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdd1:
          State : clean
    Update Time : Mon Oct 14 01:52:28 2013
         Events : 155281
   Device Role : Active device 2
   Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sde1:
          State : clean
    Update Time : Mon Oct 14 01:52:28 2013
         Events : 155281
   Device Role : Active device 0
   Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdf1:
          State : clean
    Update Time : Mon Oct 14 01:52:28 2013
         Events : 155281
   Device Role : Active device 1
   Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/sdg1:
          State : clean
    Update Time : Mon Oct 14 01:52:28 2013
         Events : 155279
   Device Role : Active device 5
   Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)



Not sure is this is significant but at boot time they are all shown as
spares though the indexing seems odd in that index 2 is skipped:

# cat /proc/mdstat
Personalities :
md127 : inactive sdf1[1](S) sde1[0](S) sdg1[6](S) sdd1[3](S)
sdb1[5](S) sdc1[4](S)
      11717972214 blocks super 1.2

unused devices: <none>


Then I do an `mdadm --stop /dev/md127` before trying the assemble.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help