Re: raid1 repair does not repair errors?

From: Michael Tokarev <hidden>
Date: 2014-02-03 17:46:48

03.02.2014 11:30, Michael Tokarev пишет:

quoted hunk ↗ jump to hunk

03.02.2014 08:36, NeilBrown wrote:
[]

quoted

Actually I've changed my mind.  That patch won't fix anything.
fix_sync_read_error() is focussed on fixing a read error on ->read_disk.
So we only set uptodate if ->read_disk succeeded.

So this patch should do it.

NeilBrown

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index fd3a2a14b587..0fe5fd469e74 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c

@@ -1733,7 +1733,8 @@ static void end_sync_read(struct bio *bio, int error)
 	 * or re-read if the read failed.
 	 * We don't do much here, just schedule handling by raid1d
 	 */
-	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
+	if (bio == r1_bio->bios[r1_bio->read_disk] &&
+	    test_bit(BIO_UPTODATE, &bio->bi_flags))
 		set_bit(R1BIO_Uptodate, &r1_bio->state);
 
 	if (atomic_dec_and_test(&r1_bio->remaining))

I changed it like this for now:

--- ../linux-3.10/drivers/md/raid1.c	2014-02-02 16:01:55.003119836 +0400
+++ drivers/md/raid1.c	2014-02-03 11:26:59.062845829 +0400

@@ -1634,8 +1634,12 @@ static void end_sync_read(struct bio *bi
 	 * or re-read if the read failed.
 	 * We don't do much here, just schedule handling by raid1d
 	 */
-	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
-		set_bit(R1BIO_Uptodate, &r1_bio->state);
+	if (bio == r1_bio->bios[r1_bio->read_disk]) {
+	    if (test_bit(BIO_UPTODATE, &bio->bi_flags))
+ 		set_bit(R1BIO_Uptodate, &r1_bio->state);
+	    else
+		printk("end_sync_read: our bio, but !BIO_UPTODATE\n");
+	}

 	if (atomic_dec_and_test(&r1_bio->remaining))
 		reschedule_retry(r1_bio);

and will test it later today (in about 10 hours from now) -- as I
mentioned, this is a prod box and testing isn't possible anytime.

Hm. With this change, there's nothing interesting in dmesg at all anymore:

[  100.571933] md: requested-resync of RAID array md1
[  100.572430] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[  100.573041] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync.
[  100.574135] md: using 128k window, over a total of 2096064k.
[  202.616277] ata6.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
[  202.617035] ata6.00: irq_stat 0x40000008
[  202.617439] ata6.00: failed command: READ FPDMA QUEUED
[  202.617980] ata6.00: cmd 60/80:c0:00:3e:3e/00:00:00:00:00/40 tag 24 ncq 65536 in
[  202.617980]          res 41/40:00:23:3e:3e/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[  202.619631] ata6.00: status: { DRDY ERR }
[  202.620077] ata6.00: error: { UNC }
[  202.622692] ata6.00: configured for UDMA/133
[  202.623182] sd 5:0:0:0: [sdd] Unhandled sense code
[  202.623693] sd 5:0:0:0: [sdd]
[  202.624011] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  202.624600] sd 5:0:0:0: [sdd]
[  202.624929] Sense Key : Medium Error [current] [descriptor]
[  202.625576] Descriptor sense data with sense descriptors (in hex):
[  202.626249]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[  202.627276]         00 3e 3e 23
[  202.627722] sd 5:0:0:0: [sdd]
[  202.628049] Add. Sense: Unrecovered read error - auto reallocate failed
[  202.628773] sd 5:0:0:0: [sdd] CDB:
[  202.629135] Read(10): 28 00 00 3e 3e 00 00 00 80 00
[  202.629907] end_request: I/O error, dev sdd, sector 4079139
[  202.630513] ata6: EH complete
[  205.219181] md: md1: requested-resync done.

So my printk is not called, and it still does not try to recover
the error.

So the first condition in that new if -- bio == r1_bio->bios[r1_bio->read_disk] --
is not true.

So it is - apparently - a wrong guess... ;)

BTW, just in case, -- this is a raid1 with 4 drives (yes, 4 copies of
the same data), not a "usual" 2 disk.

I can test something in next 2..3 hours, next test will be possible
only tomorrow.

Thanks,

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help