Thread (11 messages) 11 messages, 4 authors, 2013-02-28

Re: Persistent failures with simple md setup

From: NeilBrown <hidden>
Date: 2013-02-05 03:44:48

On Mon, 04 Feb 2013 21:43:29 +0100 Hans-Peter Jansen [off-list ref] wrote:
Am Mittwoch, 30. Januar 2013, 18:12:46 schrieb Hans-Peter Jansen:
quoted
Hmm, according to mdadm from openSUSE:12.1:Update, the relevant fixes should
be in place. It might be an unfortunate combination of this issue and the
asynchronously applied updates, interfered by the *switching* behavior.

I started with regenerating the initrds now, and a first reboot succeeded so
far. Good.

Will ask my friend to reboot the system a dozen times tonight.
After a few reboots, the issue reappeared. I really believe now, that by
driving the md in degraded mode for some time and with the switching behavior, 
just re-adding the devices resulted in unsynced raid1 devices.

Next, my friend managed to create a nearby data disaster: I've explained him,
how he would be able to re-add a device himself. He did so on sunday with his
home partition, and since there appeared no progress bar in /proc/mdstat, he 
immediately repeated the command. 

Neil, is it conceivable (due to a race or the like), that repeating to add 
(re-add) a device potentially creates data salad, since that home-fs (xfs) 
gone mad a few minutes later (firefox crashed, and couldn't be started, kmail 
crashed, and so on (all those processes, that write to ~). He decided to 
reboot, and that jailed him in the emergency recovery console, because /home 
couldn't be mounted anymore.
There was a bug prior to 2.6.37 (fixed by commit 1a855a0606653d2) which
sounds vaguely related, but you seem to be running a 3.0.x kernel(?) so
shouldn't be affected by that.

Without logs of precisely what happened, it is very hard to guess.


Today, I hammered the raid1 partitions with "check". During one run, this 
appeared in syslog:

Feb  4 11:18:26 zaphkiel kernel: [11165.652478] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb  4 11:18:26 zaphkiel kernel: [11165.652486] ata2.00: irq_stat 0x40000008
Feb  4 11:18:26 zaphkiel kernel: [11165.652495] ata2.00: failed command: READ FPDMA QUEUED
Feb  4 11:18:26 zaphkiel kernel: [11165.652510] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
Feb  4 11:18:26 zaphkiel kernel: [11165.652513]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb  4 11:18:26 zaphkiel kernel: [11165.652520] ata2.00: status: { DRDY ERR }
Feb  4 11:18:26 zaphkiel kernel: [11165.652524] ata2.00: error: { UNC }
Feb  4 11:18:26 zaphkiel kernel: [11165.652876] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100)
Feb  4 11:18:26 zaphkiel kernel: [11165.652882] ata2.00: revalidation failed (errno=-5)
Feb  4 11:18:26 zaphkiel kernel: [11165.652890] ata2: hard resetting link
Feb  4 11:18:26 zaphkiel kernel: [11165.957043] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb  4 11:18:26 zaphkiel kernel: [11165.969910] ata2.00: configured for UDMA/133
Feb  4 11:18:26 zaphkiel kernel: [11165.970048] ata2: EH complete
Feb  4 11:18:28 zaphkiel kernel: [11167.949241] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb  4 11:18:28 zaphkiel kernel: [11167.949249] ata2.00: irq_stat 0x40000008
Feb  4 11:18:28 zaphkiel kernel: [11167.949257] ata2.00: failed command: READ FPDMA QUEUED
Feb  4 11:18:28 zaphkiel kernel: [11167.949272] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
Feb  4 11:18:28 zaphkiel kernel: [11167.949275]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb  4 11:18:28 zaphkiel kernel: [11167.949282] ata2.00: status: { DRDY ERR }
Feb  4 11:18:28 zaphkiel kernel: [11167.949287] ata2.00: error: { UNC }
Feb  4 11:18:28 zaphkiel kernel: [11167.962146] ata2.00: configured for UDMA/133
Feb  4 11:18:28 zaphkiel kernel: [11167.962206] ata2: EH complete
Feb  4 11:18:30 zaphkiel kernel: [11169.898187] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb  4 11:18:30 zaphkiel kernel: [11169.898195] ata2.00: irq_stat 0x40000008
Feb  4 11:18:30 zaphkiel kernel: [11169.898204] ata2.00: failed command: READ FPDMA QUEUED
Feb  4 11:18:30 zaphkiel kernel: [11169.898219] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
Feb  4 11:18:30 zaphkiel kernel: [11169.898222]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb  4 11:18:30 zaphkiel kernel: [11169.898229] ata2.00: status: { DRDY ERR }
Feb  4 11:18:30 zaphkiel kernel: [11169.898234] ata2.00: error: { UNC }
Feb  4 11:18:30 zaphkiel kernel: [11169.912066] ata2.00: configured for UDMA/133
Feb  4 11:18:30 zaphkiel kernel: [11169.912117] ata2: EH complete
Feb  4 11:18:32 zaphkiel kernel: [11171.905192] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb  4 11:18:32 zaphkiel kernel: [11171.905200] ata2.00: irq_stat 0x40000008
Feb  4 11:18:32 zaphkiel kernel: [11171.905208] ata2.00: failed command: READ FPDMA QUEUED
Feb  4 11:18:32 zaphkiel kernel: [11171.905223] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
Feb  4 11:18:32 zaphkiel kernel: [11171.905226]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb  4 11:18:32 zaphkiel kernel: [11171.905233] ata2.00: status: { DRDY ERR }
Feb  4 11:18:32 zaphkiel kernel: [11171.905238] ata2.00: error: { UNC }
Feb  4 11:18:32 zaphkiel kernel: [11171.919099] ata2.00: configured for UDMA/133
Feb  4 11:18:32 zaphkiel kernel: [11171.919152] ata2: EH complete
Feb  4 11:18:34 zaphkiel kernel: [11173.912191] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb  4 11:18:34 zaphkiel kernel: [11173.912199] ata2.00: irq_stat 0x40000008
Feb  4 11:18:34 zaphkiel kernel: [11173.912208] ata2.00: failed command: READ FPDMA QUEUED
Feb  4 11:18:34 zaphkiel kernel: [11173.912223] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
Feb  4 11:18:34 zaphkiel kernel: [11173.912226]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb  4 11:18:34 zaphkiel kernel: [11173.912233] ata2.00: status: { DRDY ERR }
Feb  4 11:18:34 zaphkiel kernel: [11173.912238] ata2.00: error: { UNC }
Feb  4 11:18:34 zaphkiel kernel: [11173.925101] ata2.00: configured for UDMA/133
Feb  4 11:18:34 zaphkiel kernel: [11173.925159] ata2: EH complete
Feb  4 11:18:36 zaphkiel kernel: [11175.861152] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb  4 11:18:36 zaphkiel kernel: [11175.861160] ata2.00: irq_stat 0x40000008
Feb  4 11:18:36 zaphkiel kernel: [11175.861168] ata2.00: failed command: READ FPDMA QUEUED
Feb  4 11:18:36 zaphkiel kernel: [11175.861183] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
Feb  4 11:18:36 zaphkiel kernel: [11175.861186]          res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb  4 11:18:36 zaphkiel kernel: [11175.861193] ata2.00: status: { DRDY ERR }
Feb  4 11:18:36 zaphkiel kernel: [11175.861198] ata2.00: error: { UNC }
Feb  4 11:18:36 zaphkiel kernel: [11175.874052] ata2.00: configured for UDMA/133
Feb  4 11:18:36 zaphkiel kernel: [11175.874103] sd 1:0:0:0: [sdb] Unhandled sense code
Feb  4 11:18:36 zaphkiel kernel: [11175.874109] sd 1:0:0:0: [sdb]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb  4 11:18:36 zaphkiel kernel: [11175.874117] sd 1:0:0:0: [sdb]  Sense Key : Medium Error [current] [descriptor]
Feb  4 11:18:36 zaphkiel kernel: [11175.874125] Descriptor sense data with sense descriptors (in hex):
Feb  4 11:18:36 zaphkiel kernel: [11175.874130]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Feb  4 11:18:36 zaphkiel kernel: [11175.874145]         0c c2 ef 3f 
Feb  4 11:18:36 zaphkiel kernel: [11175.874153] sd 1:0:0:0: [sdb]  Add. Sense: Unrecovered read error - auto reallocate failed
Feb  4 11:18:36 zaphkiel kernel: [11175.874163] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 0c c2 ef 12 00 00 80 00
Feb  4 11:18:36 zaphkiel kernel: [11175.874180] end_request: I/O error, dev sdb, sector 214101823
Feb  4 11:18:36 zaphkiel kernel: [11175.874234] ata2: EH complete
Feb  4 11:18:38 zaphkiel kernel: [11177.954091] md: md124: data-check done.

This is a classical URE, isn't it? Interestingly, nonetheless, the raid1 check 
run succeeded! (Not so good, is it?)
What?  Success not good?  :-)

md didn't report any errors.  Maybe it didn't see any.  Where is sector
214101823?


Last question: since I had to massage the system anyway, I've updated mdadm 
from 3.2.2 to 3.2.6. I red, that it can be dangerous to do so, what do I risk
here?
Where did you read that?
If you find you need to re-create an array with "--create", a different mdadm
might give a different result, which wouldn't be what you want.  Otherwise it
should be safe.

NeilBrown

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help