Re: Update to mdadm V3.2.5 => RAID starts to recover (reproducible)

From: Andreas Baer <hidden>
Date: 2013-08-29 09:55:09

On 8/26/13, NeilBrown [off-list ref] wrote:

quoted hunk ↗ jump to hunk

On Thu, 22 Aug 2013 15:20:06 +0200 Andreas Baer [off-list ref]
wrote:

quoted

Short description:
I've discovered a problem during re-assembly of a clean RAID. mdadm
throws one disk out because this disk apparently shows another disk as
failed. After assembly, RAID starts to recover on existing spare disk.

In detail:
1. RAID-6 (Superblock V0.90.00) created with mdadm V2.6.4 and with 7
active disks and 1 spare disk (disk size: 1 TB), fully synced and
clean.
2. RAID-6 stopped and re-assembled with mdadm V3.2.5, but during that
one disk is thrown out.

Manual assembly command for /dev/md0, relevant partitions are
/dev/sd[b-i]1:
# mdadm --assemble --scan -vvv
mdadm: looking for devices for /dev/md0
mdadm: no RAID superblock on /dev/sdi
mdadm: no RAID superblock on /dev/sdh
mdadm: no RAID superblock on /dev/sdg
mdadm: no RAID superblock on /dev/sdf
mdadm: no RAID superblock on /dev/sde
mdadm: no RAID superblock on /dev/sdd
mdadm: no RAID superblock on /dev/sdc
mdadm: no RAID superblock on /dev/sdb
mdadm: no RAID superblock on /dev/sda1
mdadm: no RAID superblock on /dev/sda
mdadm: /dev/sdi1 is identified as a member of /dev/md0, slot 7.
mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot 6.
mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 5.
mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 4.
mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 0.
mdadm: ignoring /dev/sdb1 as it reports /dev/sdi1 as failed
mdadm: no uptodate device for slot 0 of /dev/md0
mdadm: added /dev/sdd1 to /dev/md0 as 2
mdadm: added /dev/sde1 to /dev/md0 as 3
mdadm: added /dev/sdf1 to /dev/md0 as 4
mdadm: added /dev/sdg1 to /dev/md0 as 5
mdadm: added /dev/sdh1 to /dev/md0 as 6
mdadm: added /dev/sdi1 to /dev/md0 as 7
mdadm: added /dev/sdc1 to /dev/md0 as 1
mdadm: /dev/md0 has been started with 6 drives (out of 7) and 1 spare.

I finally made a test by modifying mdadm V3.2.5 sources to not write
any data to any superblock and to simply exit() somewhere in the
middle of assembly process to be able to reproduce this behavior
without any RAID re-creation/synchronization.
So using mdadm V2.6.4 /dev/md0 assembles without problems and if I
switch to mdadm V3.2.5 it shows the same messages as above.

The real problem:
I have more than a single machine receiving a similar software update
so I need to find a solution or workaround around this problem. By the
way, from another test without an existing spare disk, there seems to
be no 'throwing out'-problem when switching from V2.6.4 to V3.2.5.

It would also be a great help if someone could explain the reason
behind the relevant code fragment for rejecting a device, e.g. why is
only the 'most_recent' device important?

/* If this device thinks that 'most_recent' has failed, then
  * we must reject this device.
  */
if (j != most_recent &&
    content->array.raid_disks > 0 &&
    devices[most_recent].i.disk.raid_disk >= 0 &&
    devmap[j * content->array.raid_disks +
devices[most_recent].i.disk.raid_disk] == 0) {
    if (verbose > -1)
        fprintf(stderr, Name ": ignoring %s as it reports %s as
failed\n",
            devices[j].devname, devices[most_recent].devname);
    best[i] = -1;
    continue;
}

I also attached some files showing some details about related
superblocks before and after assembly as well as about RAID status
itself.


Thanks for the thorough report.  I think this issue has been fixed in
3.3-rc1
You can fix it for 3.2.5 by applying the following patch:

diff --git a/Assemble.c b/Assemble.c
index 227d66f..bc65c29 100644
--- a/Assemble.c
+++ b/Assemble.c

@@ -849,7 +849,8 @@ int Assemble(struct supertype *st, char *mddev,
 		devices[devcnt].i.disk.minor = minor(stb.st_rdev);
 		if (most_recent < devcnt) {
 			if (devices[devcnt].i.events
-			    > devices[most_recent].i.events)
+			    > devices[most_recent].i.events &&
+			    devices[devcnt].i.disk.state == 6)
 				most_recent = devcnt;
 		}
 		if (content->array.level == LEVEL_MULTIPATH)

The "most recent" device is important as we need to choose one to compare
all
others again.  The problem is that the code in 3.2.5 can sometimes choose a
spare, which isn't such a good idea.

The "most recent" is also important because when a collection of devices is
given to the kernel it will give priority to some information which is on
the
last device passed in.  So we make sure that the last device given to the
kernel is the "most recent".

Please let me know if the patch fixes your problem.

NeilBrown

First of all, thanks for your very helpful 'most recent disk' explanation.

Sadly, the patch didn't fix my problem because the event counters are
really equal on all disks (inclusive spare) and the first disk that is
checked is the spare disk so there is no reason to set another disk as
'most recent disk', but I improved your patch a little bit by
providing more output and created also an own solution, but that needs
review because I'm not sure if it can be done like that.

Patch 1: Your solution with more output
Diff: mdadm-3.2.5-noassemble-patch1.diff
Assembly: mdadm-3.2.5-noassemble-patch1.txt

Patch 2: My proposed solution
Diff: mdadm-3.2.5-noassemble-patch2.diff
Assembly: mdadm-3.2.5-noassemble-patch2.txt

Attachments

mdadm-3.2.5-noassemble-patch1.txt [text/plain] 2049 bytes · preview
mdadm-3.2.5-noassemble-patch1.diff [application/octet-stream] 2649 bytes · preview
mdadm-3.2.5-noassemble-patch2.diff [application/octet-stream] 3044 bytes · preview
mdadm-3.2.5-noassemble-patch2.txt [text/plain] 2023 bytes · preview

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help