Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1
From: Jinpu Wang <jinpu.wang@cloud.ionos.com>
Date: 2019-08-06 11:58:10
Also in:
lkml
On Tue, Aug 6, 2019 at 9:54 AM Jinpu Wang [off-list ref] wrote:
On Tue, Aug 6, 2019 at 1:46 AM NeilBrown [off-list ref] wrote:quoted
On Mon, Aug 05 2019, Jinpu Wang wrote:quoted
Hi Neil, For the md higher write IO latency problem, I bisected it to these commits: 4ad23a97 MD: use per-cpu counter for writes_pending 210f7cd percpu-refcount: support synchronous switch to atomic mode. Do you maybe have an idea? How can we fix it?Hmmm.... not sure.Hi Neil, Thanks for reply, detailed result in line.quoted
My guess is that the set_in_sync() call from md_check_recovery() is taking a long time, and is being called too often. Could you try two experiments please.Baseline on 5.3-rc3: root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage_2019_0806_092003.log write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 fio-2.2.10 Starting 1 process write-test: (groupid=0, jobs=1): err= 0: pid=2621: Tue Aug 6 09:20:44 2019 write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec slat (usec): min=2, max=69992, avg= 5.37, stdev=374.95 clat (usec): min=0, max=147, avg= 2.42, stdev=13.57 lat (usec): min=2, max=70079, avg= 7.84, stdev=376.07 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 1], | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1], | 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], | 99.00th=[ 96], 99.50th=[ 125], 99.90th=[ 137], 99.95th=[ 139], | 99.99th=[ 141] bw (KB /s): min=18454, max=21608, per=100.00%, avg=20005.15, stdev=352.24 lat (usec) : 2=98.52%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.06% lat (usec) : 100=0.46%, 250=0.94% cpu : usr=4.64%, sys=0.00%, ctx=197118, majf=0, minf=11 IO depths : 1=98.5%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=1.3%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s, mint=40001msec, maxt=40001msec Disk stats (read/write): md0: ios=60/199436, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%quoted
1/ set /sys/block/md0/md/safe_mode_delay to 20 or more. It defaults to about 0.2.only set 20 to safe_mode_delay, give a nice improvement. root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage_2019_0806_092144.log write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 fio-2.2.10 Starting 1 process write-test: (groupid=0, jobs=1): err= 0: pid=2676: Tue Aug 6 09:22:25 2019 write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec slat (usec): min=2, max=99490, avg= 2.98, stdev=222.46 clat (usec): min=0, max=103, avg= 0.96, stdev= 4.51 lat (usec): min=2, max=99581, avg= 3.99, stdev=222.71 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1], | 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 90], 99.95th=[ 91], | 99.99th=[ 95] bw (KB /s): min=20000, max=20008, per=100.00%, avg=20001.82, stdev= 3.38 lat (usec) : 2=99.72%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% lat (usec) : 100=0.25%, 250=0.01% cpu : usr=3.17%, sys=1.48%, ctx=199470, majf=0, minf=11 IO depths : 1=99.7%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s, mint=40001msec, maxt=40001msec Disk stats (read/write): md0: ios=60/199461, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%quoted
2/ comment out the call the set_in_sync() in md_check_recovery().Only commented out set_in_sync get a better improvement root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage+_2019_0806_093340.log write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 fio-2.2.10 Starting 1 process write-test: (groupid=0, jobs=1): err= 0: pid=2626: Tue Aug 6 09:34:20 2019 write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec slat (usec): min=2, max=29, avg= 2.49, stdev= 0.72 clat (usec): min=0, max=101, avg= 0.78, stdev= 1.17 lat (usec): min=2, max=117, avg= 3.34, stdev= 1.25 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1], | 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1], | 99.99th=[ 72] bw (KB /s): min=20000, max=20008, per=100.00%, avg=20002.03, stdev= 3.50 lat (usec) : 2=99.96%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% lat (usec) : 100=0.02%, 250=0.01% cpu : usr=4.17%, sys=0.00%, ctx=199951, majf=0, minf=12 IO depths : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s, mint=40001msec, maxt=40001msec Disk stats (read/write): md0: ios=60/199435, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% With both applied root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage+_2019_0806_093916.log write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32 fio-2.2.10 Starting 1 process write-test: (groupid=0, jobs=1): err= 0: pid=2709: Tue Aug 6 09:39:57 2019 write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec slat (usec): min=2, max=16, avg= 2.46, stdev= 0.69 clat (usec): min=0, max=100, avg= 0.61, stdev= 1.18 lat (usec): min=2, max=104, avg= 3.12, stdev= 1.33 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0], | 30.00th=[ 0], 40.00th=[ 0], 50.00th=[ 1], 60.00th=[ 1], | 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 1], | 99.00th=[ 1], 99.50th=[ 1], 99.90th=[ 1], 99.95th=[ 1], | 99.99th=[ 70] bw (KB /s): min=20000, max=20008, per=100.00%, avg=20002.73, stdev= 3.82 lat (usec) : 2=99.96%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% lat (usec) : 100=0.02%, 250=0.01% cpu : usr=3.33%, sys=1.31%, ctx=199941, majf=0, minf=12 IO depths : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s, mint=40001msec, maxt=40001msec Disk stats (read/write): md0: ios=60/199460, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%quoted
Then run the least separately after each of these changes.quoted
I the second one makes a difference, I'd like to know how often it gets called - and why. The test if ( ! ( (mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) || test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || test_bit(MD_RECOVERY_DONE, &mddev->recovery) || (mddev->external == 0 && mddev->safemode == 1) || (mddev->safemode == 2 && !mddev->in_sync && mddev->recovery_cp == MaxSector) )) return;
I added following debug message:
--- a/drivers/md/md.c
+++ b/drivers/md/md.c@@ -2404,6 +2404,8 @@ static void export_array(struct mddev *mddev) static bool set_in_sync(struct mddev *mddev) { lockdep_assert_held(&mddev->lock); + pr_info("md %s in_sync is %d, sb_flags %ld, recovery %ld,
external %d, safemode %d, recovery_cp %llu\n",mdname(mddev),
mddev->in_sync, mddev->sb_flags, mddev->recovery, mddev->external,
mddev->safemode, (unsigned long long)mddev->recovery_cp);
+
if (!mddev->in_sync) {
mddev->sync_checkers++;
spin_unlock(&mddev->lock);@@ -2411,6 +2413,7 @@ static bool set_in_sync(struct mddev *mddev) spin_lock(&mddev->lock); if (!mddev->in_sync && percpu_ref_is_zero(&mddev->writes_pending)) { + pr_info("set_in_sync\n"); mddev->in_sync = 1; /* * Ensure ->in_sync is visible before we clear
The relevant dmesg: [ 103.976374] md/raid1:md0: not clean -- starting background reconstruction [ 103.976479] md/raid1:md0: active with 2 out of 2 mirrors [ 103.976597] md0: detected capacity change from 0 to 535822336 [ 103.976721] md: resync of RAID array md0 [ 104.427369] md md0 in_sync is 0, sb_flags 6, recovery 3, external 0, safemode 0, recovery_cp 393216 [ 105.052186] md md0 in_sync is 0, sb_flags 2, recovery 3, external 0, safemode 0, recovery_cp 458752 [ 105.127718] set_in_sync [ 105.133299] md md0 in_sync is 0, sb_flags 2, recovery 3, external 0, safemode 0, recovery_cp 524288 [ 105.207723] set_in_sync [ 105.213318] md md0 in_sync is 0, sb_flags 2, recovery 3, external 0, safemode 0, recovery_cp 589824 [ 105.287717] set_in_sync [ 105.293299] md md0 in_sync is 0, sb_flags 2, recovery 3, external 0, safemode 0, recovery_cp 655360 [ 105.347718] set_in_sync [ 105.353267] md md0 in_sync is 0, sb_flags 2, recovery 3, external 0, safemode 0, recovery_cp 720896 [ 105.397717] set_in_sync [ 105.403367] md md0 in_sync is 0, sb_flags 2, recovery 3, external 0, safemode 0, recovery_cp 786432 [ 105.457718] set_in_sync [ 105.981324] md md0 in_sync is 0, sb_flags 2, recovery 3, external 0, safemode 0, recovery_cp 851968 [ 106.017718] set_in_sync [ 106.022784] md md0 in_sync is 0, sb_flags 2, recovery 3, external 0, safemode 0, recovery_cp 917504 [ 106.057718] set_in_sync [ 106.063357] md md0 in_sync is 0, sb_flags 2, recovery 3, external 0, safemode 0, recovery_cp 983040 [ 106.117718] set_in_sync [ 106.122851] md: md0: resync done. [ 106.123261] md md0 in_sync is 0, sb_flags 5, recovery 19, external 0, safemode 0, recovery_cp 18446744073709551615 [ 106.157741] md md0 in_sync is 0, sb_flags 0, recovery 32, external 0, safemode 0, recovery_cp 18446744073709551615 [ 106.217718] set_in_sync [ 144.707722] md md0 in_sync is 0, sb_flags 0, recovery 0, external 0, safemode 0, recovery_cp 18446744073709551615 [ 144.767720] set_in_sync Please let me know if you need further information, I can also test any draft patch if needed. Thanks, Jack Wang
quoted
should normally return when doing lots of IO - I'd like to know which condition causes it to not return.I will check, and report later today. Thanks again! Jack Wangquoted
Thanks, NeilBrown