Re: Help recovering a RAID5, what seems to be a strange state
From: Roy Sigurd Karlsbakk <hidden>
Date: 2022-07-04 17:55:22
Have you tried to do a resync or repair of the raid? I've written a bit about that here https://wiki.karlsbakk.net/index.php/Roy's_notes#Resync I'd suggest 'repair', since that tends to fix things. PS: If you don't have a backup, make one first. NEVER beleive a raid is backup, please ;) Vennlig hilsen roy -- Roy Sigurd Karlsbakk (+47) 98013356 http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- Hið góða skaltu í stein höggva, hið illa í snjó rita. ----- Original Message -----
From: "Von Fugal" <redacted> To: "Linux Raid" <redacted> Sent: Monday, 4 July, 2022 19:41:27 Subject: Re: Help recovering a RAID5, what seems to be a strange state
I did get the array to reassemble. It's still strange to me having all devices removed, but then listed again. Incremental adds always resulted in the bad state, but what finally assembled the array was "mdadm -A --force /dev/md51" started from the array stopped and without any incremental adds. It's still doing recovery but it looks good. I may follow up on this thread again if it goes south. Cheers! On Sun, Jul 3, 2022 at 3:57 PM Von Fugal [off-list ref] wrote:quoted
Tl;Dr version: I restored partition tables with different end sectors initially. Started raids to ill effect. Restored correct partition tables and things seemed OK but degraded until they weren't. Current state is 3 devices with the same event numbers, but the raid is "dirty" and cannot start degraded *and* dirty. I know the array initially ran with sd[abd]4 and I added the "missing" sdc4 whence it did something strange while attempting to resync. sdc4 is now a "spare" but cannot be added after an attempted incremental run with the other 3. Either way, after trying to run the array, the table from 'mdadm -D' looks similar to this: Number Major Minor RaidDevice State - 0 0 0 removed - 0 0 1 removed - 0 0 2 removed - 0 0 3 removed - 8 52 2 sync /dev/sdd4 - 8 36 - spare /dev/sdc4 - 8 20 0 sync /dev/sdb4 - 8 4 1 sync /dev/sda4 Long story version follows I have 4 drives partitioned into different raid types. partition 4 is a raid5 across all 4 drives. For some reason my gpt partition tables were all wiped, and I suspect benchmarking with fio (though I only ever gave it an lvm volume to operate on). I boot systemrescuecd and testdisk finds the original partitions so I tell it to restore those. So far seems good. I start assembling some arrays, others don't work yet. lvm is starting to show contents it finds in the so far assembled arrays (this is still within systemrescuecd). Investigating the unassembled arrays, dmesg is complaining about the array size changed. I find a suggestion to use "-U devicesize". I believe this was my first mistake. The arrays assemble but lvm hangs indefinitely at this point. I desperately search for any info I have on the partitions and arrays and I find a spreadsheet on my laptop that contains meticulous partition detail. I find that some of the partition ends leave a gap before the next partition begins. Whatever. I fix the partition tables. This time, all the arrays assemble and lvm is happy!! YES. However each array has one missing partition member and it's not the same disk on each. That's strange. However my server is running. I'm able to boot it normally and homeassistant is back up. I then re-add each missing partition to each array (I believe this was my second mistake). I go to bed while it reconstructs. In the morning, the array it was reconstructing is back into pending, the raid5 array in question is inactive, and it's reconstructing something else. I remove each partition that I previously added to each array (although the array in question doesn't even let me do this) . I stop the array in question and zero the superblock of the partition I wanted to remove. I zero the superblocks on each other partition removed. I then re-add each partition to each array and let them resync. I now have 3 out of 5 fully operational, one more resync in progress. But my array in question is still kinda hosed. Here's where it's strange. Rather than explain everything, here's the status from the devices (mdadm -E) and the array (mdadm -D). https://pastebin.com/Gyj8d7Z7 Note the table at the end of mdadm -D (end of the paste). It shows four devices "removed", then a gap, then 3 devices as 'sync' . If I incrementally add the drives it shows a "normal" table. Until I try --run, then it shows the odd table. If I add --incremental 3 drives (not the 'spare') then run, it shows the pasted table. If I try to add the fourth (spare) it says "ADD_NEW_DISK not supported" in dmesg. If I add 3 drives including the 'spare' it's the same behavior otherwise, but adding the fourth drive complains that it can only add it as a spare, and I must use force-spare to add it (I suspect this would be my 3rd mistake if I did it). I think I can force run this array with sd[abd]4 but the normal commands give errors when trying to do so. What's also strange is that devices sd[abd]4 all have the same event count, yet trying to start the array results in "cannot start dirty degraded array".-- You keep up the good fight just as long as you feel you need to. -- Ken Danagger