Thread (14 messages) 14 messages, 4 authors, 2019-08-20

Re: Bisected: Kernel 4.14 + has 3 times higher write IO latency than Kernel 4.4 with raid1

From: Jinpu Wang <jinpu.wang@cloud.ionos.com>
Date: 2019-08-06 07:54:35
Also in: lkml

On Tue, Aug 6, 2019 at 1:46 AM NeilBrown [off-list ref] wrote:
On Mon, Aug 05 2019, Jinpu Wang wrote:
quoted
Hi Neil,

For the md higher write IO latency problem, I bisected it to these commits:

4ad23a97 MD: use per-cpu counter for writes_pending
210f7cd percpu-refcount: support synchronous switch to atomic mode.

Do you maybe have an idea? How can we fix it?
Hmmm.... not sure.
Hi Neil,

Thanks for reply, detailed result in line.
My guess is that the set_in_sync() call from md_check_recovery()
is taking a long time, and is being called too often.

Could you try two experiments please.
Baseline on 5.3-rc3:
root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage_2019_0806_092003.log
write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process

write-test: (groupid=0, jobs=1): err= 0: pid=2621: Tue Aug  6 09:20:44 2019
  write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec
    slat (usec): min=2, max=69992, avg= 5.37, stdev=374.95
    clat (usec): min=0, max=147, avg= 2.42, stdev=13.57
     lat (usec): min=2, max=70079, avg= 7.84, stdev=376.07
    clat percentiles (usec):
     |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    1],
     | 30.00th=[    1], 40.00th=[    1], 50.00th=[    1], 60.00th=[    1],
     | 70.00th=[    1], 80.00th=[    1], 90.00th=[    1], 95.00th=[    1],
     | 99.00th=[   96], 99.50th=[  125], 99.90th=[  137], 99.95th=[  139],
     | 99.99th=[  141]
    bw (KB  /s): min=18454, max=21608, per=100.00%, avg=20005.15, stdev=352.24
    lat (usec) : 2=98.52%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.06%
    lat (usec) : 100=0.46%, 250=0.94%
  cpu          : usr=4.64%, sys=0.00%, ctx=197118, majf=0, minf=11
  IO depths    : 1=98.5%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=1.3%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
mint=40001msec, maxt=40001msec

Disk stats (read/write):
    md0: ios=60/199436, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
aggrutil=0.00%
  ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

1/ set  /sys/block/md0/md/safe_mode_delay
   to 20 or more.  It defaults to about 0.2.
only set 20 to safe_mode_delay,  give a nice improvement.
root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage_2019_0806_092144.log
write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process

write-test: (groupid=0, jobs=1): err= 0: pid=2676: Tue Aug  6 09:22:25 2019
  write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec
    slat (usec): min=2, max=99490, avg= 2.98, stdev=222.46
    clat (usec): min=0, max=103, avg= 0.96, stdev= 4.51
     lat (usec): min=2, max=99581, avg= 3.99, stdev=222.71
    clat percentiles (usec):
     |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
     | 30.00th=[    1], 40.00th=[    1], 50.00th=[    1], 60.00th=[    1],
     | 70.00th=[    1], 80.00th=[    1], 90.00th=[    1], 95.00th=[    1],
     | 99.00th=[    1], 99.50th=[    1], 99.90th=[   90], 99.95th=[   91],
     | 99.99th=[   95]
    bw (KB  /s): min=20000, max=20008, per=100.00%, avg=20001.82, stdev= 3.38
    lat (usec) : 2=99.72%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
    lat (usec) : 100=0.25%, 250=0.01%
  cpu          : usr=3.17%, sys=1.48%, ctx=199470, majf=0, minf=11
  IO depths    : 1=99.7%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
mint=40001msec, maxt=40001msec

Disk stats (read/write):
    md0: ios=60/199461, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
aggrutil=0.00%
  ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

2/ comment out the call the set_in_sync() in md_check_recovery().
Only commented out set_in_sync get a better improvement
root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage+_2019_0806_093340.log
write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process

write-test: (groupid=0, jobs=1): err= 0: pid=2626: Tue Aug  6 09:34:20 2019
  write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec
    slat (usec): min=2, max=29, avg= 2.49, stdev= 0.72
    clat (usec): min=0, max=101, avg= 0.78, stdev= 1.17
     lat (usec): min=2, max=117, avg= 3.34, stdev= 1.25
    clat percentiles (usec):
     |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
     | 30.00th=[    1], 40.00th=[    1], 50.00th=[    1], 60.00th=[    1],
     | 70.00th=[    1], 80.00th=[    1], 90.00th=[    1], 95.00th=[    1],
     | 99.00th=[    1], 99.50th=[    1], 99.90th=[    1], 99.95th=[    1],
     | 99.99th=[   72]
    bw (KB  /s): min=20000, max=20008, per=100.00%, avg=20002.03, stdev= 3.50
    lat (usec) : 2=99.96%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
    lat (usec) : 100=0.02%, 250=0.01%
  cpu          : usr=4.17%, sys=0.00%, ctx=199951, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
mint=40001msec, maxt=40001msec

Disk stats (read/write):
    md0: ios=60/199435, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
aggrutil=0.00%
  ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

With both applied
root@ib2:/home/jwang# cat md_lat_ib2_5.3.0-rc3-1-storage+_2019_0806_093916.log
write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process

write-test: (groupid=0, jobs=1): err= 0: pid=2709: Tue Aug  6 09:39:57 2019
  write: io=800004KB, bw=20000KB/s, iops=4999, runt= 40001msec
    slat (usec): min=2, max=16, avg= 2.46, stdev= 0.69
    clat (usec): min=0, max=100, avg= 0.61, stdev= 1.18
     lat (usec): min=2, max=104, avg= 3.12, stdev= 1.33
    clat percentiles (usec):
     |  1.00th=[    0],  5.00th=[    0], 10.00th=[    0], 20.00th=[    0],
     | 30.00th=[    0], 40.00th=[    0], 50.00th=[    1], 60.00th=[    1],
     | 70.00th=[    1], 80.00th=[    1], 90.00th=[    1], 95.00th=[    1],
     | 99.00th=[    1], 99.50th=[    1], 99.90th=[    1], 99.95th=[    1],
     | 99.99th=[   70]
    bw (KB  /s): min=20000, max=20008, per=100.00%, avg=20002.73, stdev= 3.82
    lat (usec) : 2=99.96%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
    lat (usec) : 100=0.02%, 250=0.01%
  cpu          : usr=3.33%, sys=1.31%, ctx=199941, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=200001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: io=800004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
mint=40001msec, maxt=40001msec

Disk stats (read/write):
    md0: ios=60/199460, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
aggrutil=0.00%
  ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
Then run the least separately after each of these changes.
I the second one makes a difference, I'd like to know how often it gets
called - and why.  The test
        if ( ! (
                (mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
                test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
                test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
                (mddev->external == 0 && mddev->safemode == 1) ||
                (mddev->safemode == 2
                 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
                ))
                return;

should normally return when doing lots of IO - I'd like to know
which condition causes it to not return.
I will check, and report later today.
Thanks again!

Jack Wang
Thanks,
NeilBrown
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help