Re: How do I tell which disk failed?

From: Ross Boylan <hidden>
Date: 2013-01-08 07:49:46

On Tue, 2013-01-08 at 00:17 -0700, Chris Murphy wrote:

On Jan 7, 2013, at 11:59 PM, Ross Boylan [off-list ref] wrote:

quoted

Isn't it possible there's a hardware problem, e.g., leading to a
failure/retry cycle?

smartctl -a /dev/sda
smartctl -a /dev/sdb
smartctl -a /dev/sdc

Compare them. If there was a write failure reported by the drive, md would have marked the device faulty.

SMART seems to think they are all OK, though my understanding of it is
limited (e.g., the logs showed SMART reporting Temperature_Celsius of
110, but I think that's a normalized value for a raw of 42, meaning the
temp is 42 degrees celsius). Do I need to manually run a test before
the report reflects current conditions? At any rate, I did (just a
short one), and the drives passed.

The raw value (last column) for one of the parameters seems to be
changing extremely rapidly, and perhaps is overflowing:
# date; smartctl -a /dev/sda | grep 195
Mon Jan 7 23:11:03 PST 2013
195 Hardware_ECC_Recovered 0x001a 059 024 000 Old_age Always - 241377818
# date; smartctl -a /dev/sda | grep 195
Mon Jan 7 23:12:26 PST 2013
195 Hardware_ECC_Recovered 0x001a 056 024 000 Old_age Always - 3600778
Perhaps someone on this list can interpret that better than I.

My thought was disk failure (not necessarily complete failure) -> system
lockup. Continued disk flakiness leads to continued slowness after
restart as, e.g., the disk keeps retrying operations that fail.

I infer you have a different scenario in mind: the system freaks out for
a reason unrelated to the disk. The resulting shutdown (which was a
manual power off) leaves the arrays and their components in a funky
state. When the system comes back, it fixes things up.

Even if this did happen, in RAID 1 wouldn't some of the componnents
(partitions in my case) be deemed good and others bad, with the latter
resynced to match the former? And if that is happening, why can't I
tell which partition(s) are master (considered good) and which are not
(being overwritten with contents of the master)?

The sync just completed, so I can no longer poke around while the
rebuild is in process. Bad for learning and diagnosis, but good for
almost every other purpose.

Ross

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help