Re: Help recovering a RAID5, what seems to be a strange state

From: Roy Sigurd Karlsbakk <hidden>
Date: 2022-07-04 17:55:22

Have you tried to do a resync or repair of the raid? I've written a bit about that here

https://wiki.karlsbakk.net/index.php/Roy's_notes#Resync

I'd suggest 'repair', since that tends to fix things.

PS: If you don't have a backup, make one first. NEVER beleive a raid is backup, please ;)

Vennlig hilsen

roy
-- 
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.

----- Original Message -----

From: "Von Fugal" <redacted>
To: "Linux Raid" <redacted>
Sent: Monday, 4 July, 2022 19:41:27
Subject: Re: Help recovering a RAID5, what seems to be a strange state

I did get the array to reassemble. It's still strange to me having all
devices removed, but then listed again. Incremental adds always
resulted in the bad state, but what finally assembled the array was
"mdadm -A --force /dev/md51" started from the array stopped and
without any incremental adds.

It's still doing recovery but it looks good. I may follow up on this
thread again if it goes south.

Cheers!

On Sun, Jul 3, 2022 at 3:57 PM Von Fugal [off-list ref] wrote:

quoted

Tl;Dr version:
I restored partition tables with different end sectors initially.
Started raids to ill effect. Restored correct partition tables and
things seemed OK but degraded until they weren't.

Current state is 3 devices with the same event numbers, but the raid
is "dirty" and cannot start degraded *and* dirty. I know the array
initially ran with sd[abd]4 and I added the "missing" sdc4 whence it
did something strange while attempting to resync.

sdc4 is now a "spare" but cannot be added after an attempted
incremental run with the other 3. Either way, after trying to run the
array, the table from 'mdadm -D' looks similar to this:

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       -       0        0        1      removed
       -       0        0        2      removed
       -       0        0        3      removed

       -       8       52        2      sync   /dev/sdd4
       -       8       36        -      spare   /dev/sdc4
       -       8       20        0      sync   /dev/sdb4
       -       8        4        1      sync   /dev/sda4

Long story version follows

I have 4 drives partitioned into different raid types. partition 4 is
a raid5 across all 4 drives. For some reason my gpt partition tables
were all wiped, and I suspect benchmarking with fio (though I only
ever gave it an lvm volume to operate on). I boot systemrescuecd and
testdisk finds the original partitions so I tell it to restore those.
So far seems good. I start assembling some arrays, others don't work
yet. lvm is starting to show contents it finds in the so far assembled
arrays (this is still within systemrescuecd).

Investigating the unassembled arrays, dmesg is complaining about the
array size changed. I find a suggestion to use "-U devicesize". I
believe this was my first mistake. The arrays assemble but lvm hangs
indefinitely at this point.

I desperately search for any info I have on the partitions and arrays
and I find a spreadsheet on my laptop that contains meticulous
partition detail. I find that some of the partition ends leave a gap
before the next partition begins. Whatever. I fix the partition
tables. This time, all the arrays assemble and lvm is happy!! YES.

However each array has one missing partition member and it's not the
same disk on each. That's strange. However my server is running. I'm
able to boot it normally and homeassistant is back up. I then re-add
each missing partition to each array (I believe this was my second
mistake). I go to bed while it reconstructs.

In the morning, the array it was reconstructing is back into pending,
the raid5 array in question is inactive, and it's reconstructing
something else. I remove each partition that I previously added to
each array (although the array in question doesn't even let me do
this) . I stop the array in question and zero the superblock of the
partition I wanted to remove. I zero the superblocks on each other
partition removed. I then re-add each partition to each array and let
them resync. I now have 3 out of 5 fully operational, one more resync
in progress.

But my array in question is still kinda hosed. Here's where it's
strange. Rather than explain everything, here's the status from the
devices (mdadm -E) and the array (mdadm -D).
https://pastebin.com/Gyj8d7Z7

Note the table at the end of mdadm -D (end of the paste). It shows
four devices "removed", then a gap, then 3 devices as 'sync' . If I
incrementally add the drives it shows a "normal" table. Until I try
--run, then it shows the odd table. If I add --incremental 3 drives
(not the 'spare') then run, it shows the pasted table. If I try to add
the fourth (spare) it says "ADD_NEW_DISK not supported" in dmesg. If I
add 3 drives including the 'spare' it's the same behavior otherwise,
but adding the fourth drive complains that it can only add it as a
spare, and I must use force-spare to add it (I suspect this would be
my 3rd mistake if I did it).

I think I can force run this array with sd[abd]4 but the normal
commands give errors when trying to do so. What's also strange is that
devices sd[abd]4 all have the same event count, yet trying to start
the array results in "cannot start dirty degraded array".



--
You keep up the good fight just as long as you feel you need to.
-- Ken Danagger

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help