Re: Persistent failures with simple md setup
From: Hans-Peter Jansen <hidden>
Date: 2013-02-04 20:43:29
Am Mittwoch, 30. Januar 2013, 18:12:46 schrieb Hans-Peter Jansen:
Hmm, according to mdadm from openSUSE:12.1:Update, the relevant fixes should be in place. It might be an unfortunate combination of this issue and the asynchronously applied updates, interfered by the *switching* behavior. I started with regenerating the initrds now, and a first reboot succeeded so far. Good. Will ask my friend to reboot the system a dozen times tonight.
After a few reboots, the issue reappeared. I really believe now, that by
driving the md in degraded mode for some time and with the switching behavior,
just re-adding the devices resulted in unsynced raid1 devices.
Next, my friend managed to create a nearby data disaster: I've explained him,
how he would be able to re-add a device himself. He did so on sunday with his
home partition, and since there appeared no progress bar in /proc/mdstat, he
immediately repeated the command.
Neil, is it conceivable (due to a race or the like), that repeating to add
(re-add) a device potentially creates data salad, since that home-fs (xfs)
gone mad a few minutes later (firefox crashed, and couldn't be started, kmail
crashed, and so on (all those processes, that write to ~). He decided to
reboot, and that jailed him in the emergency recovery console, because /home
couldn't be mounted anymore.
Both parts of the mirror were affected, the "old" part was ~200kb
undestructive xfs_repair log, the other ~900kb, hence I decided to use the
smaller one. First I failed and removed the other part, and then attempted to
repair it. Unfortunately, the real repair run bailed out with:
disconnected inode 2161430687, moving to lost+found
corrupt dinode 2161430687, extent total = 1, nblocks = 0. This is a bug.
Please capture the filesystem metadata with xfs_metadump and
report it to xfs@oss.sgi.com.
cache_node_purge: refcount was 1, not zero (node=0xf867208)
fatal error -- 117 - couldn't iget disconnected inode
although I already used the (current) xfsprogs-3.1.6 version. :-(
After fixing that issue manually with xfs_db (== great fun), I was able to
recover the filesystem. It lost(+found) just a few new items, nobody cares
about... So far, so good..
Now, the unsynced state disturbed me. Just re-adding the bad device might
result in an invalid mirror again. A "repair" run cannot be controlled. Hence
I zeroed the superblock of that partition, and added it. Et voila, it
completely synced that mirror. Good.
Today, I hammered the raid1 partitions with "check". During one run, this
appeared in syslog:
Feb 4 11:18:26 zaphkiel kernel: [11165.652478] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb 4 11:18:26 zaphkiel kernel: [11165.652486] ata2.00: irq_stat 0x40000008
Feb 4 11:18:26 zaphkiel kernel: [11165.652495] ata2.00: failed command: READ FPDMA QUEUED
Feb 4 11:18:26 zaphkiel kernel: [11165.652510] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
Feb 4 11:18:26 zaphkiel kernel: [11165.652513] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb 4 11:18:26 zaphkiel kernel: [11165.652520] ata2.00: status: { DRDY ERR }
Feb 4 11:18:26 zaphkiel kernel: [11165.652524] ata2.00: error: { UNC }
Feb 4 11:18:26 zaphkiel kernel: [11165.652876] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x100)
Feb 4 11:18:26 zaphkiel kernel: [11165.652882] ata2.00: revalidation failed (errno=-5)
Feb 4 11:18:26 zaphkiel kernel: [11165.652890] ata2: hard resetting link
Feb 4 11:18:26 zaphkiel kernel: [11165.957043] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 4 11:18:26 zaphkiel kernel: [11165.969910] ata2.00: configured for UDMA/133
Feb 4 11:18:26 zaphkiel kernel: [11165.970048] ata2: EH complete
Feb 4 11:18:28 zaphkiel kernel: [11167.949241] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb 4 11:18:28 zaphkiel kernel: [11167.949249] ata2.00: irq_stat 0x40000008
Feb 4 11:18:28 zaphkiel kernel: [11167.949257] ata2.00: failed command: READ FPDMA QUEUED
Feb 4 11:18:28 zaphkiel kernel: [11167.949272] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
Feb 4 11:18:28 zaphkiel kernel: [11167.949275] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb 4 11:18:28 zaphkiel kernel: [11167.949282] ata2.00: status: { DRDY ERR }
Feb 4 11:18:28 zaphkiel kernel: [11167.949287] ata2.00: error: { UNC }
Feb 4 11:18:28 zaphkiel kernel: [11167.962146] ata2.00: configured for UDMA/133
Feb 4 11:18:28 zaphkiel kernel: [11167.962206] ata2: EH complete
Feb 4 11:18:30 zaphkiel kernel: [11169.898187] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb 4 11:18:30 zaphkiel kernel: [11169.898195] ata2.00: irq_stat 0x40000008
Feb 4 11:18:30 zaphkiel kernel: [11169.898204] ata2.00: failed command: READ FPDMA QUEUED
Feb 4 11:18:30 zaphkiel kernel: [11169.898219] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
Feb 4 11:18:30 zaphkiel kernel: [11169.898222] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb 4 11:18:30 zaphkiel kernel: [11169.898229] ata2.00: status: { DRDY ERR }
Feb 4 11:18:30 zaphkiel kernel: [11169.898234] ata2.00: error: { UNC }
Feb 4 11:18:30 zaphkiel kernel: [11169.912066] ata2.00: configured for UDMA/133
Feb 4 11:18:30 zaphkiel kernel: [11169.912117] ata2: EH complete
Feb 4 11:18:32 zaphkiel kernel: [11171.905192] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb 4 11:18:32 zaphkiel kernel: [11171.905200] ata2.00: irq_stat 0x40000008
Feb 4 11:18:32 zaphkiel kernel: [11171.905208] ata2.00: failed command: READ FPDMA QUEUED
Feb 4 11:18:32 zaphkiel kernel: [11171.905223] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
Feb 4 11:18:32 zaphkiel kernel: [11171.905226] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb 4 11:18:32 zaphkiel kernel: [11171.905233] ata2.00: status: { DRDY ERR }
Feb 4 11:18:32 zaphkiel kernel: [11171.905238] ata2.00: error: { UNC }
Feb 4 11:18:32 zaphkiel kernel: [11171.919099] ata2.00: configured for UDMA/133
Feb 4 11:18:32 zaphkiel kernel: [11171.919152] ata2: EH complete
Feb 4 11:18:34 zaphkiel kernel: [11173.912191] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb 4 11:18:34 zaphkiel kernel: [11173.912199] ata2.00: irq_stat 0x40000008
Feb 4 11:18:34 zaphkiel kernel: [11173.912208] ata2.00: failed command: READ FPDMA QUEUED
Feb 4 11:18:34 zaphkiel kernel: [11173.912223] ata2.00: cmd 60/80:e0:12:ef:c2/00:00:0c:00:00/40 tag 28 ncq 65536 in
Feb 4 11:18:34 zaphkiel kernel: [11173.912226] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb 4 11:18:34 zaphkiel kernel: [11173.912233] ata2.00: status: { DRDY ERR }
Feb 4 11:18:34 zaphkiel kernel: [11173.912238] ata2.00: error: { UNC }
Feb 4 11:18:34 zaphkiel kernel: [11173.925101] ata2.00: configured for UDMA/133
Feb 4 11:18:34 zaphkiel kernel: [11173.925159] ata2: EH complete
Feb 4 11:18:36 zaphkiel kernel: [11175.861152] ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb 4 11:18:36 zaphkiel kernel: [11175.861160] ata2.00: irq_stat 0x40000008
Feb 4 11:18:36 zaphkiel kernel: [11175.861168] ata2.00: failed command: READ FPDMA QUEUED
Feb 4 11:18:36 zaphkiel kernel: [11175.861183] ata2.00: cmd 60/80:10:12:ef:c2/00:00:0c:00:00/40 tag 2 ncq 65536 in
Feb 4 11:18:36 zaphkiel kernel: [11175.861186] res 41/40:53:3f:ef:c2/00:00:0c:00:00/40 Emask 0x409 (media error) <F>
Feb 4 11:18:36 zaphkiel kernel: [11175.861193] ata2.00: status: { DRDY ERR }
Feb 4 11:18:36 zaphkiel kernel: [11175.861198] ata2.00: error: { UNC }
Feb 4 11:18:36 zaphkiel kernel: [11175.874052] ata2.00: configured for UDMA/133
Feb 4 11:18:36 zaphkiel kernel: [11175.874103] sd 1:0:0:0: [sdb] Unhandled sense code
Feb 4 11:18:36 zaphkiel kernel: [11175.874109] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 4 11:18:36 zaphkiel kernel: [11175.874117] sd 1:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
Feb 4 11:18:36 zaphkiel kernel: [11175.874125] Descriptor sense data with sense descriptors (in hex):
Feb 4 11:18:36 zaphkiel kernel: [11175.874130] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 4 11:18:36 zaphkiel kernel: [11175.874145] 0c c2 ef 3f
Feb 4 11:18:36 zaphkiel kernel: [11175.874153] sd 1:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
Feb 4 11:18:36 zaphkiel kernel: [11175.874163] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 0c c2 ef 12 00 00 80 00
Feb 4 11:18:36 zaphkiel kernel: [11175.874180] end_request: I/O error, dev sdb, sector 214101823
Feb 4 11:18:36 zaphkiel kernel: [11175.874234] ata2: EH complete
Feb 4 11:18:38 zaphkiel kernel: [11177.954091] md: md124: data-check done.
This is a classical URE, isn't it? Interestingly, nonetheless, the raid1 check
run succeeded! (Not so good, is it?)
Before you ask, both drives have a sane timeout already:
smartctl -l scterc /dev/sda
smartctl 6.0 2012-10-10 r3643 [i686-linux-3.1.10-1.9-desktop] (SUSE RPM)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
smartctl -l scterc /dev/sdb
smartctl 6.0 2012-10-10 r3643 [i686-linux-3.1.10-1.9-desktop] (SUSE RPM)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
Attached is the result of smartctl -x of sda (good), and sdb (bad).
Could somebody from the audience have a look into it, and give me an
assessment, how dangerous the state of this drive really is.
Last question: since I had to massage the system anyway, I've updated mdadm
from 3.2.2 to 3.2.6. I red, that it can be dangerous to do so, what do I risk
here?
Thanks in advance,
Pete Attachments
- smart-sdb-20130204-140400.log [text/x-log] 17301 bytes · preview
- smart-sda-20130204-140400.log [text/x-log] 10814 bytes · preview