Re: raid6 rebuild not starting

From: Anssi Hannula <hidden>
Date: 2011-12-12 05:22:17

On Mon, Dec 12, 2011 at 5:01 AM, NeilBrown [off-list ref] wrote:

On Sun, 11 Dec 2011 09:03:14 +0200 Anssi Hannula [off-list ref] wrote:

quoted

Hi!

After I rebooted during a raid6 rebuild, the rebuild didn't start again.
Instead, there is a flood of "RAID conf printout"s that seemingly happen
on array activity.

All the devices show up properly in --detail and two devices are marked
as "spare rebuilding", and I can access the contents of the array just
fine, but the rebuild doesn't actually start. Is this a bug or am I
missing something? :)

I was initially on 2.6.38.8, but also tried 3.1.4 which seems to have
the same issue. mdadm is 3.1.5.

I'm not using start_ro and writing to the array doesn't trigger a
rebuild either.

Attached are --examine outputs before assembly, kernel log output on
assembly, /proc/mdstat and --detail after assembly (on 3.1.4).

Thank you for the very detailed problem report.

Thanks for the quick response :)

Unfortunately it is a complete mystery to me what is happening.

The repeated "RAID conf printout" messages are almost certainly coming from
the end of raid5_remove_disk.
It is being called from remove_and_add_spares for each of the two devices
that are being rebuilt.  raid5_remove_disk declines to remove them because it
can keep rebuilding them.

remove_and_add_spares then counts them and notes there are 2.
md_check_recovery notes that this is > 0, so it should create a thread to run
md_do_sync.

md_do_sync should then print out a message like
 md: recovery of RAID array md0

but it doesn't.  So something went wrong.
There are three reasons that md_do_sync might not print a message:

1/ MD_RECOVERY_DONE is set.  As only md_do_sync ever sets it, that is
   unlikely, and in any case md_check_recovery clears it.
2/ mddev->ro != 0.  It is only ever set to 0, 1, or 2.  If it is 1 or 2
  then we would be able to see that in /proc/mdstat as a "(readonly)"
  status.  But we don't.
3/ MD_RECOVERY_INTR is set. Again, md_check_recovery clears this.  It does
  get set if kthread_should_stop() returns 'true', but that should only
  happen if kthread_stop() was called.  That is only called by
  md_unregister_thread and I cannot see any way that could be call.

So.  No idea.

Are you compiling these kernels yourself?

Nope (used Mageia kernels), but I did now (3.1.5).

If so, could you:
 - put a printk in the top of md_do_sync to report the values of
  mddev->recovery and mddev->ro
 - print a message whenever md_unregister_thread is called
 - in md_check_recovery, in the
               if (mddev->ro) {
                       /* Only thing we do on a ro array is remove
                        * failed devices.
                        */
                       mdk_rdev_t *rdev;

 in statement, print the value of mddev->ro.

Then see which of those printk's fire, and what they tell us.

Only the last one does, and mddev->ro == 0.

For reference, attached is the used patch and resulting log output.

-- 
Anssi Hannula

Attachments

dmesg-dbg.txt [text/plain] 7847 bytes · preview
dbg.patch [text/x-patch] 1013 bytes · preview

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help