Thread (9 messages) 9 messages, 2 authors, 2011-12-12

Re: raid6 rebuild not starting

From: NeilBrown <hidden>
Date: 2011-12-12 03:01:19

On Sun, 11 Dec 2011 09:03:14 +0200 Anssi Hannula [off-list ref] wrote:
Hi!

After I rebooted during a raid6 rebuild, the rebuild didn't start again.
Instead, there is a flood of "RAID conf printout"s that seemingly happen
on array activity.

All the devices show up properly in --detail and two devices are marked
as "spare rebuilding", and I can access the contents of the array just
fine, but the rebuild doesn't actually start. Is this a bug or am I
missing something? :)

I was initially on 2.6.38.8, but also tried 3.1.4 which seems to have
the same issue. mdadm is 3.1.5.

I'm not using start_ro and writing to the array doesn't trigger a
rebuild either.

Attached are --examine outputs before assembly, kernel log output on
assembly, /proc/mdstat and --detail after assembly (on 3.1.4).
Thank you for the very detailed problem report.

Unfortunately it is a complete mystery to me what is happening.

The repeated "RAID conf printout" messages are almost certainly coming from
the end of raid5_remove_disk.
It is being called from remove_and_add_spares for each of the two devices
that are being rebuilt.  raid5_remove_disk declines to remove them because it
can keep rebuilding them.

remove_and_add_spares then counts them and notes there are 2.
md_check_recovery notes that this is > 0, so it should create a thread to run
md_do_sync.

md_do_sync should then print out a message like
  md: recovery of RAID array md0

but it doesn't.  So something went wrong.
There are three reasons that md_do_sync might not print a message:

1/ MD_RECOVERY_DONE is set.  As only md_do_sync ever sets it, that is
    unlikely, and in any case md_check_recovery clears it.
2/ mddev->ro != 0.  It is only ever set to 0, 1, or 2.  If it is 1 or 2
   then we would be able to see that in /proc/mdstat as a "(readonly)"
   status.  But we don't.
3/ MD_RECOVERY_INTR is set. Again, md_check_recovery clears this.  It does
   get set if kthread_should_stop() returns 'true', but that should only
   happen if kthread_stop() was called.  That is only called by
   md_unregister_thread and I cannot see any way that could be call.

So.  No idea.

Are you compiling these kernels yourself?
If so, could you:
 - put a printk in the top of md_do_sync to report the values of
   mddev->recovery and mddev->ro
 - print a message whenever md_unregister_thread is called
 - in md_check_recovery, in the 
		if (mddev->ro) {
			/* Only thing we do on a ro array is remove
			 * failed devices.
			 */
			mdk_rdev_t *rdev;

  in statement, print the value of mddev->ro.

Then see which of those printk's fire, and what they tell us.

NeilBrown

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help