Re: interesting case of a hung 'recovery'
From: Eyal Lebedinsky <hidden>
Date: 2017-03-13 04:12:20
Bump again... All the data I have is at the bottom. On 09/03/17 22:00, Eyal Lebedinsky wrote:
On 09/03/17 20:13, Jack Wang wrote:quoted
2017-03-09 8:39 GMT+01:00 Eyal Lebedinsky [off-list ref]:quoted
Bump. On 18/02/17 23:14, Eyal Lebedinsky wrote:quoted
I should start by saying that this is an old fedora 19 system Executive summary: after '--add'ing a new member a 'recovery' starts but 'sync_max' is not reset. $ uname -a Linux e7.eyal.emu.id.au 3.14.27-100.fc19.x86_64 #1 SMP Wed Dec 17 19:36:34 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux$ sudo mdadm --version mdadm - v4.0 - 2017-01-09quoted
so the issue may have been fixed since. I had a disk fail in a raid6. After some 'pending' sectors were logged I decided to do a 'check' around that location (set sync_min/max and echo 'check'). Sure enough it elicited disk errors, but the disk did not recover and it was kicked out of the array. Moreover it became unresponsive. It needed a power cycle so I shutdown and rebooted the machine. Not one to give up easily I tried the check again, with the same result. It was time to '--remove' this array member. I then '--add'ed a new disk which started a recovery. A few hours later I noticed that it slowed down. A lot. It actually did not progress at all for a few hours (I was away from the machine). As I was staring at the screen (for a long while) I realised that it stopped at 55.5%, and this number is exactly where the original 'check' failed (I still do not understand why with my bad memory I remembered this number). I checked 'sync_completed' and it was proper. I then examined 'sync_max' and it was wrong - it had the location where the very early 'check' failed earlier in the day. It was the same sector where it is now paused at - looks related. I decided to take a (small) risk and do # echo 'max' >/sys/block/md127/md/sync_max at which point the recovery moved on. It should be finished in about 5 hours. I do not think that it is correct for 'sync_max' to not be set to 'max' when a new member is added - it surely requires a full recovery. I really hope (and expect) that this was actually fixed, but this note may help others facing same predicament. cheers-- Eyal Lebedinsky (eyal@eyal.emu.id.au)You'd better offer attach much detailed information, then people can help.I know, but this is a live system and I cannot experiment with it. There were no unusual messages (see below) at all. The 'recovery' simply stopped. This is the usual behaviour when one sets sync_min/max and starts a (for example) 'check'. As I described above, I found that after the --add 'sync_max' was NOT set to 'max' and not to the full RAID size but was left at the value where the earlier 'check' failed. It failed when the disk completely disappeared. I guess the original 'check' was left pending, and when a replacement disk was added to the RAID it was automatically recovered (as it should) but the original 'check' somehow left something behind that should have been reset for the recovery.quoted
eg: https://raid.wiki.kernel.org/index.php/Asking_for_help For the problem you reported, better offer also kernel dmesg, output of blocking tasks via "echo w > /proc/sysrq-trigger" maybe also "echo t > /proc/sysrq-trigger"There was no blocked task, just that the recovery stopped progressing when it reached sync_max, as it should. The problem is that sync_max had the wrong value, the full (newly added) disk was supposed to be synced.quoted
Cheers,In case it helps, here is all the info I have of the event: ### I have a script that checks the RAID in a small region. I had an earlier error in ### on the disk and to check it my script effectively did: echo 4336657408 >sys/block/md127/md/sync_min echo 4339803136 >sys/block/md127/md/sync_max echo check >sys/block/md127/md/sync_action ### messages: Feb 18 13:46:31 e7 kernel: [ 976.688691] md: data-check of RAID array md127 Feb 18 13:46:31 e7 kernel: [ 976.693254] md: minimum _guaranteed_ speed: 150000 KB/sec/disk. Feb 18 13:46:31 e7 kernel: [ 976.699479] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. Feb 18 13:46:31 e7 kernel: [ 976.709420] md: using 128k window, over a total of 3906885120k. Feb 18 13:46:31 e7 kernel: [ 976.715457] md: resuming data-check of md127 from checkpoint. ### the failure of the 'check' ... many i/o errors then sdf completely disappeared ... errors at sectors 4337414{000,040,168} Feb 18 13:47:08 e7 kernel: [ 1014.334781] md: super_written gets error=-5, uptodate=0 Feb 18 13:47:08 e7 kernel: [ 1014.340024] md/raid:md127: Disk failure on sdf1, disabling device. Feb 18 13:47:08 e7 kernel: [ 1014.340024] md/raid:md127: Operation continuing on 6 devices. Feb 18 13:47:08 e7 kernel: [ 1014.417307] md: md127: data-check interrupted. ### a new disk (sdj) is plugged in and partitioned. $ sudo mdadm /dev/md127 --add /dev/sdj1 $ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid6 sdj1[11] sdf1[7](F) sdi1[8] sde1[9] sdh1[12] sdc1[0] sdg1[13] sdd1[10] 19534425600 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [UUU_UUU] [>....................] recovery = 0.7% (29805572/3906885120) finish=509.2min speed=126880K/sec bitmap: 7/30 pages [28KB], 65536KB chunk ### messages: Feb 18 14:23:10 e7 kernel: [ 3177.183250] md: bind<sdj1> Feb 18 14:23:10 e7 kernel: [ 3177.255529] md: recovery of RAID array md127 Feb 18 14:23:10 e7 kernel: [ 3177.259894] md: minimum _guaranteed_ speed: 150000 KB/sec/disk. Feb 18 14:23:10 e7 kernel: [ 3177.265994] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. Feb 18 14:23:10 e7 kernel: [ 3177.275736] md: using 128k window, over a total of 3906885120k. ### the point when the recovery paused: 2017-02-18 20:02:48 [===========>.........] recovery = 55.4% (2166229192/3906885120) finish=372.8min speed=77803K/sec 2017-02-18 20:02:58 [===========>.........] recovery = 55.4% (2167083344/3906885120) finish=366.2min speed=79159K/sec 2017-02-18 20:03:08 [===========>.........] recovery = 55.4% (2167819876/3906885120) finish=374.8min speed=77316K/sec 2017-02-18 20:03:18 [===========>.........] recovery = 55.5% (2168520428/3906885120) finish=375.4min speed=77157K/sec 2017-02-18 20:03:28 [===========>.........] recovery = 55.5% (2168590848/3906885120) finish=489.4min speed=59194K/sec 2017-02-18 20:03:38 [===========>.........] recovery = 55.5% (2168590848/3906885120) finish=608.7min speed=47588K/sec 2017-02-18 20:03:48 [===========>.........] recovery = 55.5% (2168590848/3906885120) finish=728.1min speed=39786K/sec 2017-02-18 20:03:58 [===========>.........] recovery = 55.5% (2168590848/3906885120) finish=847.5min speed=34182K/sec ... 2017-02-18 22:36:44 0 [===========>.........] recovery = 55.5% (2168590848/3906885120) finish=110261.8min speed=262K/sec 2017-02-18 22:36:54 0 [===========>.........] recovery = 55.5% (2168590848/3906885120) finish=110381.2min speed=262K/sec 2017-02-18 22:37:04 0 [===========>.........] recovery = 55.5% (2168590848/3906885120) finish=110500.6min speed=262K/sec 2017-02-18 22:37:14 0 [===========>.........] recovery = 55.5% (2168590848/3906885120) finish=110619.9min speed=261K/sec # echo 'max' >/sys/block/md127/md/sync_max ### recovery now moves on: 2017-02-18 22:37:24 0 [===========>.........] recovery = 55.5% (2168938500/3906885120) finish=117500.2min speed=246K/sec 2017-02-18 22:37:34 0 [===========>.........] recovery = 55.5% (2169997568/3906885120) finish=105201.7min speed=275K/sec 2017-02-18 22:37:44 0 [===========>.........] recovery = 55.5% (2171066120/3906885120) finish=90962.0min speed=318K/sec 2017-02-18 22:37:54 0 [===========>.........] recovery = 55.5% (2172125192/3906885120) finish=269.9min speed=107101K/sec 2017-02-18 22:38:04 0 [===========>.........] recovery = 55.6% (2173114372/3906885120) finish=272.1min speed=106165K/sec 2017-02-18 22:38:14 0 [===========>.........] recovery = 55.6% (2174004224/3906885120) finish=287.3min speed=100492K/sec ### and it completed over six hours later: Feb 19 04:49:16 e7 kernel: [55167.633100] md: md127: recovery done. cheers
-- Eyal Lebedinsky (eyal@eyal.emu.id.au)