Thread (7 messages) 7 messages, 2 authors, 2017-03-13

Re: interesting case of a hung 'recovery'

From: Jack Wang <hidden>
Date: 2017-03-09 09:13:01

2017-03-09 8:39 GMT+01:00 Eyal Lebedinsky [off-list ref]:
Bump.

On 18/02/17 23:14, Eyal Lebedinsky wrote:
quoted
I should start by saying that this is an old fedora 19 system

Executive summary: after '--add'ing a new member a 'recovery' starts but
'sync_max' is not reset.

$ uname -a
Linux e7.eyal.emu.id.au 3.14.27-100.fc19.x86_64 #1 SMP Wed Dec 17 19:36:34
UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

$ sudo mdadm --version
mdadm - v4.0 - 2017-01-09
quoted
so the issue may have been fixed since.

I had a disk fail in a raid6. After some 'pending' sectors were logged I
decided to do a 'check'
around that location (set sync_min/max and echo 'check'). Sure enough it
elicited disk errors,
but the disk did not recover and it was kicked out of the array. Moreover
it became unresponsive.
It needed a power cycle so I shutdown and rebooted the machine.

Not one to give up easily I tried the check again, with the same result.
It was time to '--remove' this array member. I then '--add'ed a new disk
which started a recovery.

A few hours later I noticed that it slowed down. A lot. It actually did
not progress at all for
a few hours (I was away from the machine).

As I was staring at the screen (for a long while) I realised that it
stopped at 55.5%, and this
number is exactly where the original 'check' failed (I still do not
understand why with my bad
memory I remembered this number).

I checked 'sync_completed' and it was proper.
I then examined 'sync_max' and it was wrong - it had the location where
the very early 'check'
failed earlier in the day. It was the same sector where it is now paused
at - looks related.

I decided to take a (small) risk and do
    # echo 'max' >/sys/block/md127/md/sync_max
at which point the recovery moved on. It should be finished in about 5
hours.

I do not think that it is correct for 'sync_max' to not be set to 'max'
when a new member is
added - it surely requires a full recovery.

I really hope (and expect) that this was actually fixed, but this note may
help others facing
same predicament.

cheers
--
Eyal Lebedinsky (eyal@eyal.emu.id.au)
You'd better offer attach much detailed information, then people can help.

eg:
https://raid.wiki.kernel.org/index.php/Asking_for_help

For the problem you reported, better offer also kernel dmesg, output
of blocking tasks via "echo w >  /proc/sysrq-trigger" maybe also
"echo t > /proc/sysrq-trigger"

Cheers,
Jack
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help