RE: can i recover an all spare raid10 array ?

From: Roland RoLaNd <hidden>
Date: 2014-10-28 20:17:44


----------------------------------------

Date: Tue, 28 Oct 2014 20:02:39 +0000
From: robin@robinhill.me.uk
To: r_o_l_a_n_d@hotmail.com
CC: linux-raid@vger.kernel.org
Subject: Re: can i recover an all spare raid10 array ?

On Tue Oct 28, 2014 at 09:11:21PM +0200, Roland RoLaNd wrote:

quoted


----------------------------------------

quoted

Date: Tue, 28 Oct 2014 18:34:22 +0000
From: robin@robinhill.me.uk
To: r_o_l_a_n_d@hotmail.com
CC: robin@robinhill.me.uk; linux-raid@vger.kernel.org
Subject: Re: can i recover an all spare raid10 array ?

Please don't top post, it makes conversations very difficult to follow.
Responses should go at the bottom, or interleaved with the previous post
if responding to particular points. I've moved your previous responses
to keep the conversation flow straight.

On Tue Oct 28, 2014 at 07:30:50PM +0200, Roland RoLaNd wrote:

quoted

From: r_o_l_a_n_d@hotmail.com
To: robin@robinhill.me.uk
CC: linux-raid@vger.kernel.org
Subject: Re: can i recover an all spare raid10 array ?
Date: Tue, 28 Oct 2014 19:29:25 +0200

quoted

Date: Tue, 28 Oct 2014 17:01:11 +0000
From: robin@robinhill.me.uk
To: r_o_l_a_n_d@hotmail.com
CC: linux-raid@vger.kernel.org
Subject: Re: can i recover an all spare raid10 array ?

On Tue Oct 28, 2014 at 06:22:11PM +0200, Roland RoLaNd wrote:

quoted

I have two raid arrays on my system:
raid1: /dev/sdd1 /dev/sdh1
raid10: /dev/sde1 /dev/sda1 /dev/sdf1 /dec/sdb1 /dev/sdc1 /dev/sdg1


two disks had bad sectors: sdd and sdf <<-- they both got hot swapped.
i added sdf back to raid10 and recovery took place but adding sdd1 to
raid1 proved to be troublesome
as i didn't have anything important on '/' i formatted and installed
ubuntu 14 on raid1

now system is up on raid 1, but raid10 (md127) is inactive

cat /proc/mdstat

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : inactive sde1[2](S) sdg1[8](S) sdc1[6](S) sdb1[5](S) sdf1[4](S) sda1[3](S)
17580804096 blocks super 1.2

md2 : active raid1 sdh4[0] sdd4[1]
2921839424 blocks super 1.2 [2/2] [UU]
[==>..................] resync = 10.4% (304322368/2921839424) finish=672.5min speed=64861K/sec

md1 : active raid1 sdh3[0] sdd3[1]
7996352 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sdh2[0] sdd2[1]
292544 blocks super 1.2 [2/2] [UU]

unused devices: <none>
if i try to assemble md127


mdadm --assemble /dev/md127 /dev/sde1 /dev/sda1 /dev/sdf1 /dev/sdb1 /dev/sdc1 /dev/sdg1
mdadm: /dev/sde1 is busy - skipping
mdadm: /dev/sda1 is busy - skipping
mdadm: /dev/sdf1 is busy - skipping
mdadm: /dev/sdb1 is busy - skipping
mdadm: /dev/sdc1 is busy - skipping
mdadm: /dev/sdg1 is busy - skipping


if i try to add one of the disks: mdadm --add /dev/md127 /dev/sdj1
mdadm: cannot get array info for /dev/md127

if i try:

mdadm --stop /dev/md127
mdadm: stopped /dev/md127

then running: mdadm --assemble /dev/md127 /dev/sde1 /dev/sda1 /dev/sdf1 /dev/sdb1 /dev/sdc1 /dev/sdg1

returns:

assembled from 5 drives and 1 rebuilding - not enough to start the array

what does it mean ? is my data lost ?

if i examine one of the md127 raid 10 array disks it shows this:

mdadm --examine /dev/sde1
/dev/sde1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : ab90d4c8:41a55e14:635025cc:28f0ee76
Name : ubuntu:data (local to host ubuntu)
Creation Time : Sat May 10 21:54:56 2014
Raid Level : raid10
Raid Devices : 8

Avail Dev Size : 5860268032 (2794.39 GiB 3000.46 GB)
Array Size : 11720534016 (11177.57 GiB 12001.83 GB)
Used Dev Size : 5860267008 (2794.39 GiB 3000.46 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : a2a5db61:bd79f0ae:99d97f17:21c4a619

Update Time : Tue Oct 28 10:07:18 2014
Checksum : 409deeb4 - correct
Events : 8655

Layout : near=2
Chunk Size : 512K

Device Role : Active device 2
Array State : AAAAAAAA ('A' == active, '.' == missing)

Used Dev Size : 5860267008 (2794.39 GiB 3000.46 GB) <<--- does this mean i still have my data ?


the remaining two disks:

mdadm --examine /dev/sdj1
mdadm: No md superblock detected on /dev/sdj1.
mdadm --examine /dev/sdi1
mdadm: No md superblock detected on /dev/sdi1.

The --examine output indicates the RAID10 array was 8 members, not 6.
As it stands, you are missing two array members (presumably a mirrored
pair as mdadm won't start the array). Without these you're missing 512K
of every 2M in the array, so your data is toast (well, with a lot of
effort you may recover some files under 1.5M in size).

Were you expecting sdi1 and sdj1 to have been part of the original
RAID10 array? Have you removed the superblocks from them at any point?
For completeness, what mdadm and kernel versions are you running?

Cheers,
Robin

Thanks for pitching in.here are the responses to you questions:

- yes i expected both of them to be part of the array though one of
them was just added to the array and didnt finish recovering when
raid1 "/" crashed

According to your --examine earlier, the RAID10 rebuild had completed
(it shows the array clean and having all disks active). Are you certain
that the new RAID1 array isn't using disks that used to be part of the
RAID10 array? Regardless, I'd expect the disks to have a superblock if
they were part of either array (unless they've been repartitioned?).

the examine earlier was to one of the 6 disks that belong to the current inactive array.. they're all clean
as for raid1/10 arrays, that's what i thought as it happened with me before, but lsblk shows the following:

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 2.7T 0 disk
└─sda1 8:1 0 2.7T 0 part
sdb 8:16 0 2.7T 0 disk
└─sdb1 8:17 0 2.7T 0 part
sdc 8:32 0 2.7T 0 disk
└─sdc1 8:33 0 2.7T 0 part
sdd 8:48 0 2.7T 0 disk
├─sdd1 8:49 0 1M 0 part
├─sdd2 8:50 0 286M 0 part
│ └─md0 9:0 0 285.7M 0 raid1 /boot
├─sdd3 8:51 0 7.6G 0 part
│ └─md1 9:1 0 7.6G 0 raid1 [SWAP]
└─sdd4 8:52 0 2.7T 0 part
└─md2 9:2 0 2.7T 0 raid1 /
sde 8:64 0 2.7T 0 disk
└─sde1 8:65 0 2.7T 0 part
sdf 8:80 0 2.7T 0 disk
└─sdf1 8:81 0 2.7T 0 part
sdg 8:96 0 2.7T 0 disk
└─sdg1 8:97 0 2.7T 0 part
sdh 8:112 0 2.7T 0 disk
├─sdh1 8:113 0 1M 0 part
├─sdh2 8:114 0 286M 0 part
│ └─md0 9:0 0 285.7M 0 raid1 /boot
├─sdh3 8:115 0 7.6G 0 part
│ └─md1 9:1 0 7.6G 0 raid1 [SWAP]
└─sdh4 8:116 0 2.7T 0 part
└─md2 9:2 0 2.7T 0 raid1 /
sdi 8:128 0 2.7T 0 disk
└─sdi1 8:129 0 2.7T 0 part
sdj 8:144 0 2.7T 0 disk
└─sdj1 8:145 0 2.7T 0 part

quoted

- i have not removed their superblocks or at least not in a way that i
amaware of

- mdadm: 3.2.5-5ubuntu4.1
- uname -a: 3.13.0-24-generic

That's a pretty old mdadm version, but I don't see anything in the
change logs that looks relevant. Others may be more familiar with issues
though.

that's the latest in my current ubuntu repository

quoted

PS:
I just followed this recovery page:
https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID
I managed to reach the last step, whenever i tried to mount it kept
asking me for the right file system

That's good documentation anyway. As long as you stick to the overlay
devices your original data is untouched. It's amazing how many people
run --create on their original disks and lose any chance of getting the
data back.

unfortunately i used to be/am one of those people.
had bad experiences with this before, so i took it slow and went with
the overlay documentation.
all ebooks i could found about raid speak about different between
multiple raid levels but none are thorough when it comes to setting
up/troubleshooting raid.
and once i do fix my issue, i move on to the next firefighting
situation so i lose interest due to lack of time.

quoted

Correction:i couldn't force assemble the read devices so i issued instead:
mdadm --create /dev/md089 --assume-clean --level=10 --verbose --raid-devices=8 missing /dev/dm-1 /dev/dm-0 /dev/dm-5 /dev/dm-3 /dev/dm-2 missing /dev/dm-4
which got it into degraded state

What error did you get when you tried to force assemble (both from mdadm
and anything reported via dmesg)? The device order you're using would
suggest that the missing disks wouldn't be mirrors of each other, so the
data should be okay.

mdadm --assemble --force /dev/md100 $OVERLAYS
mdadm: /dev/md100 assembled from 5 drives and 1 rebuilding - not enough to start the array.

That's very odd - all the --examine results for the original disks show
the array as clean. That would suggest an issue with the installed
version of mdadm but it doesn't really matter in this case - see below.

when i got ubuntu 14 installed, i issued apt-get update && apt-get upgrade -y
would that have affected anything ?

quoted

dmesg:
[ 6025.573964] md: md100 stopped.
[ 6025.595810] md: bind<dm-0>
[ 6025.596086] md: bind<dm-5>
[ 6025.596364] md: bind<dm-2>
[ 6025.596612] md: bind<dm-1>
[ 6025.596840] md: bind<dm-4>
[ 6025.597026] md: bind<dm-3>

quoted

Can you post the --examine results for all the RAID members? Both for
the original partitions and for the overlay devices after you recreated
the array. There may be differences in data offset, etc. which will
break the filesystem.

Original partitions:
http://pastebin.com/nHCxidvE

overlay:
http://pastebin.com/eva4cnu6

Right - these show you have the wrong order. The original partition
array device roles are:
sdc1: 2
sda1: 3
sdf1: 4
sdb1: 5
sdc1: 6
sdg1: 7

and your overlays are:
dm-1: 1
dm-0: 2
dm-5: 3
dm-3: 4
dm-2: 5
dm-4: 7

So the bad news is that you're missing roles 0 & 1, which will be
mirrors. That means your array is broken unless any other member disks
can be found

am i mistaken to think that the order of disks in an array can be known from the "   Device Role : Active device Z " in mdadm --examine /dev/sdXN ?

If you're certain that sdi1 and sdj1 should be in the array then you can
try recreating the array (in the correct order) and using sdi1/sdj1 in
the missing slots and see if one option works. I'll assume the overlay
mapping is as follows (if not, remap as required):
sda1 -> dm-0
sdb1 -> dm-1
sdc1 -> dm-2
sde1 -> dm-3
sdf1 -> dm-4
sdg1 -> dm-5
sdi1 -> dm-6
sdj1 -> dm-7

For each of the following orders, you're going to need to:
- stop the existing array (mdadm -S /dev/md089)
- create a new array using --assume-clean
- check for an valid filesystem (fsck -n /dev/md089)

If the fsck returns without errors then try mounting the filesystem and
see if all looks okay, otherwise move on to the next order.

The orders to try are:
- /dev/dm-6 missing /dev/dm-3 /dev/dm-0 /dev/dm-4 /dev/dm-1 /dev/dm-2 /dev/dm-5
- missing /dev/dm-6 /dev/dm-3 /dev/dm-0 /dev/dm-4 /dev/dm-1 /dev/dm-2 /dev/dm-5
- /dev/dm-7 missing /dev/dm-3 /dev/dm-0 /dev/dm-4 /dev/dm-1 /dev/dm-2 /dev/dm-5
- missing /dev/dm-7 /dev/dm-3 /dev/dm-0 /dev/dm-4 /dev/dm-1 /dev/dm-2 /dev/dm-5

Thank you for all the help. appreciate it

Good luck,
Robin
--
___
( ' } | Robin Hill [off-list ref] |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |

 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help