Re: Need help recovering RAID5 array

From: NeilBrown <hidden>
Date: 2011-08-09 02:55:49

On Mon, 8 Aug 2011 22:29:10 -0400 Stephen Muskiewicz
[off-list ref] wrote:

Well it looks like the first try didn't work, but adding the --force 
seems to have done the trick!  Here's the results:

snip

So it looks like I'm in business again!  Many thanks!

Great!

This does lead to a question: Do you recommend (and is it safe on CentOS 
5.5?) for me to use the updated (3.2.2 with your patch) version of mdadm 
going forward in place of the CentOS version (2.6.9)?

I wouldn't kept that patch.  It was a little hack to get your array working
again.  I wouldn't recommend using it without expert advice...

Other than that ... 3.2.2 certainly fixes bug and adds features over 2.6.9,
but maybe it adds some bugs too...  I would say that it is safe, but probably
not really necessary.
i.e. up to you :-)

quoted

I wonder how the event count got that high.  There aren't enough seconds
since the birth of the universe of it to have happened naturally...

Any chance it might be related to these kernel messages? I just noticed 
(guess I should be paying more attention to my logs) that there are tons 
of these messages repeated in my /var/log/messages file.  However as far 
as the RAID arrays themselves, we haven't seen any problems while they 
are running so I'm not sure what's causing these or whether they are 
insignificant.  Again, speculation on my part but given the huge event 
count from mdadm and the number of these messages it might seem that 
they are somehow related....

Jul 31 04:02:13 libthumper1 kernel: program diskmond is using a 
deprecated SCSI
ioctl, please convert it to SG_IO
Jul 31 04:02:26 libthumper1 last message repeated 47 times
Jul 31 04:12:11 libthumper1 kernel: md: bug in file drivers/md/md.c, 
line 1659

I need to know the exact kernel version to find out what this line is.... I
could guess but I would probably be wrong.

Jul 31 04:12:11 libthumper1 kernel:
Jul 31 04:12:11 libthumper1 kernel: md: **********************************
Jul 31 04:12:11 libthumper1 kernel: md: * <COMPLETE RAID STATE PRINTOUT> *
Jul 31 04:12:11 libthumper1 kernel: md: **********************************
Jul 31 04:12:11 libthumper1 kernel: md53: 
<sdk1><sdai1><sds1><sdam1><sdo1><sdau1><sdaq1><sdw1><sdaa1><sdae1>
Jul 31 04:12:11 libthumper1 kernel: md: rdev sdk1, SZ:488383744 F:0 S:1 
DN:10
Jul 31 04:12:11 libthumper1 kernel: md: rdev superblock:
Jul 31 04:12:11 libthumper1 kernel: md:  SB: (V:1.0.0) 
ID:<be475f67.00000000.00000000.00000000> CT:81f4e22f
Jul 31 04:12:11 libthumper1 kernel: md:     L-2009873429 S1801675106 
ND:1834971253 RD:1869771369 md114 LO:65536 CS:196610
Jul 31 04:12:11 libthumper1 kernel: md:     UT:00000000 ST:0 
AD:976767728 WD:0 FD:976767984 SD:0 CSUM:00000000 E:00000000
Jul 31 04:12:11 libthumper1 kernel:      D  0:  DISK<N:-1,(-1,-1),R:-1,S:-1>
Jul 31 04:12:11 libthumper1 kernel:      D  1:  DISK<N:-1,(-1,-1),R:-1,S:-1>
Jul 31 04:12:11 libthumper1 kernel:      D  2:  DISK<N:-1,(-1,-1),R:-1,S:-1>
Jul 31 04:12:11 libthumper1 kernel:      D  3:  DISK<N:-1,(-1,-1),R:-1,S:-1>
Jul 31 04:12:11 libthumper1 kernel: md:     THIS:  DISK<N:0,(0,0),R:0,S:0>
Jul 31 04:12:11 libthumper1 kernel: md: rdev superblock:
Jul 31 04:12:11 libthumper1 kernel: md:  SB: (V:1.0.0) 
ID:<be475f67.00000000.00000000.00000000> CT:81f4e22f
Jul 31 04:12:11 libthumper1 kernel: md:     L-2009873429 S1801675106 
ND:1834971253 RD:1869771369 md114 LO:65536 CS:196610
Jul 31 04:12:11 libthumper1 kernel: md:     UT:00000000 ST:0 
AD:976767728 WD:0 FD:976767984 SD:0 CSUM:00000000 E:00000000

<snip...and on and on>

Did it really start repeating at this point?  I would have expected a bit
more first.

So if you get me kernel version and confirm that this really is all in the
logs except for identical repeats, I'll see if I can figure out what might
have caused it - and then if it could be related to your original problem.

Of course given how old the CentOS mdadm is, maybe by updating it I'll 
be fixing this problem as well?

In general running newer code should be safer and easier to support.  Don't
know if it would fix this problem yet though.


NeilBrown

If not, I'd be willing to help delve deeper if it's something worth 
investigating.

Again, Thanks a ton for all your help and quick replies!

Cheers!
-steve

quoted

Thanks,
NeilBrown

diff --git a/super1.c b/super1.c
index 35e92a3..4a3341a 100644
--- a/super1.c
+++ b/super1.c

@@ -803,6 +803,8 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
  		       __le64_to_cpu(sb->data_size));
  	} else if (strcmp(update, "_reshape_progress")==0)
  		sb->reshape_position = __cpu_to_le64(info->reshape_progress);
+	else if (strcmp(update, "summaries") == 0)
+		sb->events = __cpu_to_le64(4);
  	else
  		rv = -1;

--

To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help