Re: [PATCH v5 5/7] raid5-ppl: load and recover the log

From: Artur Paszkiewicz <hidden>
Date: 2017-03-10 15:23:59

On 03/10/2017 12:30 AM, Shaohua Li wrote:

On Thu, Mar 09, 2017 at 10:00:01AM +0100, Artur Paszkiewicz wrote:

quoted

Load the log from each disk when starting the array and recover if the
array is dirty.

The initial empty PPL is written by mdadm. When loading the log we
verify the header checksum and signature. For external metadata arrays
the signature is verified in userspace, so here we read it from the
header, verifying only if it matches on all disks, and use it later when
writing PPL.

In addition to the header checksum, each header entry also contains a
checksum of its partial parity data. If the header is valid, recovery is
performed for each entry until an invalid entry is found. If the array
is not degraded and recovery using PPL fully succeeds, there is no need
to resync the array because data and parity will be consistent, so in
this case resync will be disabled.

Due to compatibility with IMSM implementations on other systems, we
can't assume that the recovery data block size is always 4K. Writes
generated by MD raid5 don't have this issue, but when recovering PPL
written in other environments it is possible to have entries with
512-byte sector granularity. The recovery code takes this into account
and also the logical sector size of the underlying drives.

Signed-off-by: Artur Paszkiewicz <redacted>
---
 drivers/md/raid5-ppl.c | 497 +++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c     |   5 +-
 2 files changed, 501 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 92783586743d..548d1028a3ce 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c

@@ -103,6 +103,10 @@ struct ppl_conf {
 	mempool_t *io_pool;
 	struct bio_set *bs;
 	mempool_t *meta_pool;
+
+	/* used only for recovery */
+	int recovered_entries;
+	int mismatch_count;
 };
 
 struct ppl_log {

@@ -514,6 +518,482 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
 		ppl_io_unit_finished(io);
 }
 
+static void ppl_xor(int size, struct page *page1, struct page *page2,
+		    struct page *page_result)
+{

I'd remove the page_result parameter, it should always be page1. And this will
make it clear why we need ASYNC_TX_XOR_DROP_DST.

quoted

+	struct async_submit_ctl submit;
+	struct dma_async_tx_descriptor *tx;
+	struct page *xor_srcs[] = { page1, page2 };
+
+	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
+			  NULL, NULL, NULL, NULL);
+	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
+
+	async_tx_quiesce(&tx);
+}

...

quoted

+			ret = ppl_recover_entry(log, e, ppl_sector);
+			if (ret)
+				goto out;
+			ppl_conf->recovered_entries++;
+		}
+
+		ppl_sector += ppl_entry_sectors;
+	}
+
+	/* flush the disk cache after recovery if necessary */
+	if (test_bit(QUEUE_FLAG_WC, &bdev_get_queue(rdev->bdev)->queue_flags)) {

The block layer will handle this, so you don't need to check

quoted

+		struct bio *bio = bio_alloc_bioset(GFP_KERNEL, 0, ppl_conf->bs);
+
+		bio->bi_bdev = rdev->bdev;
+		bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+		ret = submit_bio_wait(bio);
+		bio_put(bio);

please use blkdev_issue_flush() instead.

quoted

+	}
+out:
+	__free_page(page);
+	return ret;
+}
+

...

quoted

+static int ppl_load(struct ppl_conf *ppl_conf)
+{
+	int ret = 0;
+	u32 signature = 0;
+	bool signature_set = false;
+	int i;
+
+	for (i = 0; i < ppl_conf->count; i++) {
+		struct ppl_log *log = &ppl_conf->child_logs[i];
+
+		/* skip missing drive */
+		if (!log->rdev)
+			continue;
+
+		ret = ppl_load_distributed(log);
+		if (ret)
+			break;

Not sure about the strategy here. But if one disk fails, why don't we continue
do the recovery from other disks? This way we can at least recovery more data.

I thought it would be safer to abort early. Then we can for example
remove the failed drive try again or disable ppl. And if the array is
already degraded and another disk fails, the recovery won't be
meaningful anyway.

Thanks,
Artur

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help