Re: [GIT PATCH 0/2] external-metadata recovery checkpointing for 2.6.33

From: Dan Williams <hidden>
Date: 2009-12-15 00:37:58
Subsystem: software raid (multiple disks) support, the rest · Maintainers: Song Liu, Yu Kuai, Linus Torvalds

On Sun, 2009-12-13 at 21:07 -0700, Neil Brown wrote:

+static ssize_t recovery_start_store(mdk_rdev_t *rdev, const char *buf, size_t len)
+{
+	unsigned long long recovery_start;
+
+	if (cmd_match(buf, "none"))
+		recovery_start = MaxSector;
+	else if (strict_strtoull(buf, 10, &recovery_start))
+		return -EINVAL;
+
+	if (rdev->mddev->pers &&
+	    rdev->raid_disk >= 0)
+		return -EBUSY;

Ok, I had a chance to test this out and have a question about how you
envisioned mdmon handling this restriction which is a bit tighter than
what I had before.  The prior version allowed updates as long as the
array was read-only.  This version forces recovery_start to be written
at sysfs_add_disk() time (before 'slot' is written). The conceptual
problem I ran into was a race between ->activate_spare() determining the
last valid checkpoint and the monitor thread starting up the array:

->activate_spare(): read recovery checkpoint
( array becomes read/write )
( array becomes dirty, checkpoint invalidated )
sysfs_add_disk(): write invalid recovery checkpoint
( recovery starts from the wrong location )

The scheme I came up with was to not touch recovery_start in the manager
thread and let the monitor thread have the last word on the recovery
checkpoint.  It would only write to md/rdX/recovery_start at the initial
readonly->active transition, otherwise recovery starts from default-0.
Is the patch below off base?

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1cc5f2d..bd24e20 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c

@@ -2467,7 +2467,8 @@ static ssize_t recovery_start_store(mdk_rdev_t *rdev, const char *buf, size_t le
 	else if (strict_strtoull(buf, 10, &recovery_start))
 		return -EINVAL;
 
-	if (rdev->mddev->pers &&
+	if (mddev->ro != 1 &&
+	    rdev->mddev->pers &&
 	    rdev->raid_disk >= 0)
 		return -EBUSY;

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help