Re: Persistent failures with simple md setup
From: NeilBrown <hidden>
Date: 2013-02-05 03:44:48
On Mon, 04 Feb 2013 21:43:29 +0100 Hans-Peter Jansen [off-list ref] wrote:
Am Mittwoch, 30. Januar 2013, 18:12:46 schrieb Hans-Peter Jansen:quoted
Hmm, according to mdadm from openSUSE:12.1:Update, the relevant fixes should be in place. It might be an unfortunate combination of this issue and the asynchronously applied updates, interfered by the *switching* behavior. I started with regenerating the initrds now, and a first reboot succeeded so far. Good. Will ask my friend to reboot the system a dozen times tonight.After a few reboots, the issue reappeared. I really believe now, that by driving the md in degraded mode for some time and with the switching behavior, just re-adding the devices resulted in unsynced raid1 devices. Next, my friend managed to create a nearby data disaster: I've explained him, how he would be able to re-add a device himself. He did so on sunday with his home partition, and since there appeared no progress bar in /proc/mdstat, he immediately repeated the command. Neil, is it conceivable (due to a race or the like), that repeating to add (re-add) a device potentially creates data salad, since that home-fs (xfs) gone mad a few minutes later (firefox crashed, and couldn't be started, kmail crashed, and so on (all those processes, that write to ~). He decided to reboot, and that jailed him in the emergency recovery console, because /home couldn't be mounted anymore.
There was a bug prior to 2.6.37 (fixed by commit 1a855a0606653d2) which sounds vaguely related, but you seem to be running a 3.0.x kernel(?) so shouldn't be affected by that. Without logs of precisely what happened, it is very hard to guess.
Today, I hammered the raid1 partitions with "check". During one run, this
appeared in syslog:
Feb 4 11:18:26 zaphkiel kernel: [11165.652478] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb 4 11:18:26 zaphkiel kernel: [11165.652486] ata2.00: irq_stat 0x40000008
Feb 4 11:18:26 zaphkiel kernel: [11165.652495] ata2.00: failed command: READ FPDMA QUEUED
Feb 4 11:18:26 zaphkiel kernel: [11165.652510] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
Feb 4 11:18:26 zaphkiel kernel: [11165.652513] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb 4 11:18:26 zaphkiel kernel: [11165.652520] ata2.00: status: { DRDY ERR }
Feb 4 11:18:26 zaphkiel kernel: [11165.652524] ata2.00: error: { UNC }
Feb 4 11:18:26 zaphkiel kernel: [11165.652876] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100)
Feb 4 11:18:26 zaphkiel kernel: [11165.652882] ata2.00: revalidation failed (errno=-5)
Feb 4 11:18:26 zaphkiel kernel: [11165.652890] ata2: hard resetting link
Feb 4 11:18:26 zaphkiel kernel: [11165.957043] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 4 11:18:26 zaphkiel kernel: [11165.969910] ata2.00: configured for UDMA/133
Feb 4 11:18:26 zaphkiel kernel: [11165.970048] ata2: EH complete
Feb 4 11:18:28 zaphkiel kernel: [11167.949241] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb 4 11:18:28 zaphkiel kernel: [11167.949249] ata2.00: irq_stat 0x40000008
Feb 4 11:18:28 zaphkiel kernel: [11167.949257] ata2.00: failed command: READ FPDMA QUEUED
Feb 4 11:18:28 zaphkiel kernel: [11167.949272] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
Feb 4 11:18:28 zaphkiel kernel: [11167.949275] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb 4 11:18:28 zaphkiel kernel: [11167.949282] ata2.00: status: { DRDY ERR }
Feb 4 11:18:28 zaphkiel kernel: [11167.949287] ata2.00: error: { UNC }
Feb 4 11:18:28 zaphkiel kernel: [11167.962146] ata2.00: configured for UDMA/133
Feb 4 11:18:28 zaphkiel kernel: [11167.962206] ata2: EH complete
Feb 4 11:18:30 zaphkiel kernel: [11169.898187] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb 4 11:18:30 zaphkiel kernel: [11169.898195] ata2.00: irq_stat 0x40000008
Feb 4 11:18:30 zaphkiel kernel: [11169.898204] ata2.00: failed command: READ FPDMA QUEUED
Feb 4 11:18:30 zaphkiel kernel: [11169.898219] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
Feb 4 11:18:30 zaphkiel kernel: [11169.898222] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb 4 11:18:30 zaphkiel kernel: [11169.898229] ata2.00: status: { DRDY ERR }
Feb 4 11:18:30 zaphkiel kernel: [11169.898234] ata2.00: error: { UNC }
Feb 4 11:18:30 zaphkiel kernel: [11169.912066] ata2.00: configured for UDMA/133
Feb 4 11:18:30 zaphkiel kernel: [11169.912117] ata2: EH complete
Feb 4 11:18:32 zaphkiel kernel: [11171.905192] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb 4 11:18:32 zaphkiel kernel: [11171.905200] ata2.00: irq_stat 0x40000008
Feb 4 11:18:32 zaphkiel kernel: [11171.905208] ata2.00: failed command: READ FPDMA QUEUED
Feb 4 11:18:32 zaphkiel kernel: [11171.905223] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
Feb 4 11:18:32 zaphkiel kernel: [11171.905226] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb 4 11:18:32 zaphkiel kernel: [11171.905233] ata2.00: status: { DRDY ERR }
Feb 4 11:18:32 zaphkiel kernel: [11171.905238] ata2.00: error: { UNC }
Feb 4 11:18:32 zaphkiel kernel: [11171.919099] ata2.00: configured for UDMA/133
Feb 4 11:18:32 zaphkiel kernel: [11171.919152] ata2: EH complete
Feb 4 11:18:34 zaphkiel kernel: [11173.912191] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb 4 11:18:34 zaphkiel kernel: [11173.912199] ata2.00: irq_stat 0x40000008
Feb 4 11:18:34 zaphkiel kernel: [11173.912208] ata2.00: failed command: READ FPDMA QUEUED
Feb 4 11:18:34 zaphkiel kernel: [11173.912223] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
Feb 4 11:18:34 zaphkiel kernel: [11173.912226] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb 4 11:18:34 zaphkiel kernel: [11173.912233] ata2.00: status: { DRDY ERR }
Feb 4 11:18:34 zaphkiel kernel: [11173.912238] ata2.00: error: { UNC }
Feb 4 11:18:34 zaphkiel kernel: [11173.925101] ata2.00: configured for UDMA/133
Feb 4 11:18:34 zaphkiel kernel: [11173.925159] ata2: EH complete
Feb 4 11:18:36 zaphkiel kernel: [11175.861152] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb 4 11:18:36 zaphkiel kernel: [11175.861160] ata2.00: irq_stat 0x40000008
Feb 4 11:18:36 zaphkiel kernel: [11175.861168] ata2.00: failed command: READ FPDMA QUEUED
Feb 4 11:18:36 zaphkiel kernel: [11175.861183] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
Feb 4 11:18:36 zaphkiel kernel: [11175.861186] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb 4 11:18:36 zaphkiel kernel: [11175.861193] ata2.00: status: { DRDY ERR }
Feb 4 11:18:36 zaphkiel kernel: [11175.861198] ata2.00: error: { UNC }
Feb 4 11:18:36 zaphkiel kernel: [11175.874052] ata2.00: configured for UDMA/133
Feb 4 11:18:36 zaphkiel kernel: [11175.874103] sd 1:0:0:0: [sdb] Unhandled sense code
Feb 4 11:18:36 zaphkiel kernel: [11175.874109] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 4 11:18:36 zaphkiel kernel: [11175.874117] sd 1:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
Feb 4 11:18:36 zaphkiel kernel: [11175.874125] Descriptor sense data with sense descriptors (in hex):
Feb 4 11:18:36 zaphkiel kernel: [11175.874130] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 4 11:18:36 zaphkiel kernel: [11175.874145] 0c c2 ef 3f
Feb 4 11:18:36 zaphkiel kernel: [11175.874153] sd 1:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
Feb 4 11:18:36 zaphkiel kernel: [11175.874163] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 0c c2 ef 12 00 00 80 00
Feb 4 11:18:36 zaphkiel kernel: [11175.874180] end_request: I/O error, dev sdb, sector 214101823
Feb 4 11:18:36 zaphkiel kernel: [11175.874234] ata2: EH complete
Feb 4 11:18:38 zaphkiel kernel: [11177.954091] md: md124: data-check done.
This is a classical URE, isn't it? Interestingly, nonetheless, the raid1 check
run succeeded! (Not so good, is it?)What? Success not good? :-) md didn't report any errors. Maybe it didn't see any. Where is sector 214101823?
Last question: since I had to massage the system anyway, I've updated mdadm from 3.2.2 to 3.2.6. I red, that it can be dangerous to do so, what do I risk here?
Where did you read that? If you find you need to re-create an array with "--create", a different mdadm might give a different result, which wouldn't be what you want. Otherwise it should be safe. NeilBrown
Attachments
- signature.asc [application/pgp-signature] 828 bytes