Re: [PATCH v5 4/8] md/r5cache: write part of r5cache

From: Song Liu <hidden>
Date: 2016-10-14 07:33:03

On Oct 13, 2016, at 11:53 PM, NeilBrown [off-list ref] wrote:

On Thu, Oct 13 2016, Song Liu wrote:

quoted

For RMW, the code allocates an extra page for each data block
being updated.  This is stored in r5dev->page and the old data
is read into it.  Then the prexor calculation subtracts ->page
from the parity block, and the reconstruct calculation adds the
->orig_page data back into the parity block.

What happens if the alloc_page() fails?

That will be tough, but solvable.. We can
    read old data to page
    do prexor 
    read new data from journal device to page
    do xor 
    do the rest of the work. 

Or we can force the code to rcw, which does not need extra page. 
But rcw, does not always work in degraded mode. So, this is a good 
reason not to do write-back in degraded mode...

quoted

This patch includes a fix proposed by ZhengYuan Liu
[off-list ref]

Signed-off-by: Song Liu <redacted>

You introduce a new flag:

quoted

+	R5_InCache,	/* Data in cache */

but don't document it at all (apart from those three words).
I must confess that I found it confuses when you talk about the
"journal" as the "cache".  I understand way, but in my mind, the "cache"
is the in-memory cache of stripe_heads.
We now have something that we sometimes call a "journal", sometimes call
a "log" and sometimes call a "cache".  That is a recipe for confusion.

A sentence or two explaining how this flag will be used would help in
reading the rest of the code.

I will try to unify the names here, and add more comments.

quoted

+
+void r5c_handle_cached_data_endio(struct r5conf *conf,
+	  struct stripe_head *sh, int disks, struct bio_list *return_bi)
+{
+	int i;
+
+	for (i = sh->disks; i--; ) {
+		if (test_bit(R5_InCache, &sh->dev[i].flags) &&
+		    sh->dev[i].written) {

Is it possible for R5_InCache to be set, but 'written' to be NULL ???

Yes, it is possible. A stripe may go through "write data to journal, return IO" multiple
times before parity calculation. When it comes here the second time, dev written in the 
first time will have R5_InCache set, but its written will be NULL.

quoted

static void r5c_finish_cache_stripe(struct stripe_head *sh)

@@ -242,8 +323,10 @@ static void r5c_finish_cache_stripe(struct stripe_head *sh)

	if (log->r5c_state == R5C_STATE_WRITE_THROUGH) {
		BUG_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
		set_bit(STRIPE_R5C_WRITTEN, &sh->state);
-	} else
-		BUG(); /* write back logic in next patch */
+	} else if (test_bit(STRIPE_R5C_FROZEN, &sh->state))
+		r5c_handle_parity_cached(sh);
+	else
+		r5c_handle_data_cached(sh);
}

static void r5l_io_run_stripes(struct r5l_io_unit *io)

@@ -483,7 +566,8 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,

	io = log->current_io;

	for (i = 0; i < sh->disks; i++) {
-		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
+		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) &&
+		    !test_bit(R5_Wantcache, &sh->dev[i].flags))
			continue;

If changed R5_Wantcache to R5_Wantjournal, and always set it on blocks
that were destined for the journal, then this would just be

		if (!test_bit(R5_Wantjournal, &sh->dev[i].flags))

which would make lots of sense...  Just a thought.

We set R5_Wantwrite in multiple places. If we simplify the code here, we will need to make
those places aware of journal. I guess that is not ideal either?

quoted

}

-static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
+void r5l_wake_reclaim(struct r5l_log *log, sector_t space)

Why are you removing the 'static' here?  You don't call it from any
other file.

In next patch, it is called in raid.c.

quoted


static int r5l_load_log(struct r5l_log *log)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2e3e61a..0539f34 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c

@@ -245,8 +245,25 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,

			    < IO_THRESHOLD)
				md_wakeup_thread(conf->mddev->thread);
		atomic_dec(&conf->active_stripes);
-		if (!test_bit(STRIPE_EXPANDING, &sh->state))
-			list_add_tail(&sh->lru, temp_inactive_list);
+		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
+			if (atomic_read(&sh->dev_in_cache) == 0) {
+				list_add_tail(&sh->lru, temp_inactive_list);
+			} else if (atomic_read(&sh->dev_in_cache) ==
+				   conf->raid_disks - conf->max_degraded) {

Is this really the test that you want?
Presumably "full stripes" are those that can be written out without
reading anything, while "partial stripes" will need something read in
first.
It is possible that the stripe head contains data for some devices which
is not in the cache, possibly because a READ request filled it in.
So we should really be counting the number of devices with R5_UPTODATE.
??

Yes, counting R5_UPTODATE will be better. Let me check whether there are
corner cases we need to cover. 

Thanks,
Song

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help