Thread (9 messages) 9 messages, 2 authors, 2012-05-09

Re: Failed drive while converting raid5 to raid6, then a hard reboot

From: Hákon Gíslason <hidden>
Date: 2012-05-08 23:03:29

Forgot this: http://pastebin.ubuntu.com/976915/
--
Hákon G.


On 8 May 2012 22:19, Hákon Gíslason [off-list ref] wrote:
Thank you for the reply, Neil
I was using mdadm from the package manager in Debian stable first
(v3.1.4), but after the constant drive failures I upgraded to the
latest one (3.2.3).
I've come to the conclusion that the drives are either failing because
they are "green" drives, and might have power-saving features that are
causing them to be "disconnected", or that the cables that came with
the motherboard aren't good enough. I'm not 100% sure about either,
but at the moment these seem likely causes. It could be incompatible
hardware or the kernel that I'm using (proxmox debian kernel:
2.6.32-11-pve).

I got the array assembled (thank you), but what about the raid5 to
raid6 conversion? Do I have to complete it for this to work, or will
mdadm know what to do? Can I cancel (revert) the conversion and get
the array back to raid5?

/proc/mdstat contains:

root@axiom:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active (read-only) raid6 sdc[6] sdb[5] sda[4] sdd[7]
     5860540224 blocks super 1.2 level 6, 32k chunk, algorithm 18 [5/3] [_UUU_]

unused devices: <none>

If I try to mount the volume group on the array the kernel panics, and
the system hangs. Is that related to the incomplete conversion?

Thanks,
--
Hákon G.



On 8 May 2012 20:48, NeilBrown [off-list ref] wrote:
quoted
On Mon, 30 Apr 2012 13:59:56 +0000 Hákon Gíslason
[off-list ref]
wrote:
quoted
Hello,
I've been having frequent drive "failures", as in, they are reported
failed/bad and mdadm sends me an email telling me things went wrong,
etc... but after a reboot or two, they are perfectly fine again. I'm
not sure what it is, but this server is quite new and I think there
might be more behind it, bad memory or the motherboard (I've been
having other issues as well). I've had 4 drive "failures" in this
month, all different drives except for one, which "failed" twice, and
all have been fixed with a reboot or rebuild (all drives reported bad
by mdadm passed an extensive SMART test).
Due to this, I decided to convert my raid5 array to a raid6 array
while I find the root cause of the problem.

I started the conversion right after a drive failure & rebuild, but as
it had converted/reshaped aprox. 4%(if I remember correctly, and it
was going really slowly, ~7500 minutes to completion), it reported
another drive bad, and the conversion to raid6 stopped (it said
"rebuilding", but the speed was 0K/sec and the time left was a few
million minutes.
After that happened, I tried to stop the array and reboot the server,
as I had done previously to get the reportedly "bad" drive working
again, but It wouldn't stop the array or reboot, neither could I
unmount it, it just hung whenever I tried to do something with
/dev/md0. After trying to reboot a few times, I just killed the power
and re-started it. Admittedly this was probably not the best thing I
could have done at that point.

I have backup of ca. 80% of the data on there, it's been a month since
the last complete backup (because I ran out of backup disk space).

So, the big question, can the array be activated, and can it complete
the conversion to raid6? And will I get my data back?
I hope the data can be rescued, and any help I can get would be much
appreciated!

I'm fairly new to raid in general, and have been using mdadm for about
a month now.
Here's some data:

root@axiom:~# mdadm --examine --scan
ARRAY /dev/md/0 metadata=1.2 UUID=cfedbfc1:feaee982:4e92ccf4:45e08ed1
name=axiom.is:0


root@axiom:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : inactive sdc[6] sde[7] sdb[5] sda[4]
      7814054240 blocks super 1.2

root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
mdadm: /dev/md0 is already in use.

root@axiom:~# mdadm --stop /dev/md0
mdadm: stopped /dev/md0

root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
mdadm: Failed to restore critical section for reshape, sorry.
      Possibly you needed to specify the --backup-file

root@axiom:~# mdadm --assemble --scan --force --run /dev/md0
--backup-file=/root/mdadm-backup-file
mdadm: Failed to restore critical section for reshape, sorry.
What version of mdadm are you using?

I suggest getting a newer one (I'm about to release 3.2.4, but 3.2.3
should
be fine) and if just that doesn't help, add the "--invalid-backup" option.

However I very strongly suggest you try to resolve the problem which is
causing your drives to fail.  Until you resolve that it will keep
happening
and having it happen repeatly during the (slow) reshape process would not
be
good.

Maybe plug the drives into another computer, or another controller, while
the
reshape runs?

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help