Re: Help needed: Recovering a failed RAID-6 array

From: Phil Turmel <hidden>
Date: 2017-12-14 19:10:11

Hi Ryszard,

On 12/14/2017 01:30 PM, Ryszard Harasimowicz wrote:

A friend of mine has got a serious problem after replacing a failed disk
in a RAID-6 array.

[trim /]

When the Raid Device number 9 has failed the system was shutdown and the
drive was replaced.

The event counts are surprising considering the short time between first
failure and the other two devices dropping out.  Those two *think* they
are OK, and they both show the other still running.  So, a common reason
took them out.

The event counts might mean the OMV kit is trying to assemble this over
and over again.  You'll have to disable that.

Then the system was started - but the array did not rebuild (as was
expected). It showed up as FAILED with 3 drives marked as "removed".

The current state is:

[trim /]

What would be the safest strategy to try to recover data from this
array? Is it still possible?

First, stop the array:

mdadm --stop /dev/md127

Then, assemble the array with --force to get past the bad event counts:

mdadm -Afv /dev/mdX /dev/sd[abcdghijklmnop]

If that succeeds, run fsck on the filesystem(s) and then backup any
irreplaceable files.  If it fails, paste the output here.

I send attached the status report for all the drives in the array
(except for the replaced one).

It would be good to know *why* this happened.  Consider supplying
"smartctl -iA -l scterc" reports.  I suspect your distro's boot time
limits are too short, or some device didn't get recognized in the initramfs.

The output of lsdrv[1] would help identifying odd circumstances.

Phil

[1] https://github.com/pturmel/lsdrv

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help