Re: [PATCHSET block/for-next] IO cost model based work-conserving... | cgroups

Re: [PATCHSET block/for-next] IO cost model based work-conserving porportional controller

From: Paolo Valente <hidden>
Date: 2019-08-31 07:10:34
Also in: bpf, linux-block, lkml

Hi Tejun,
thank you very much for this extra information, I'll try the
configuration you suggest.  In this respect, is this still the branch
to use

https://kernel.googlesource.com/pub/scm/linux/kernel/git/tj/cgroup/+/refs/heads/review-iocost-v2

also after the issue spotted two days ago [1]?

Thanks,
Paolo

[1] https://lkml.org/lkml/2019/8/29/910

Il giorno 31 ago 2019, alle ore 08:53, Tejun Heo [off-list ref] ha scritto:

Hello, Paolo.

On Thu, Aug 22, 2019 at 10:58:22AM +0200, Paolo Valente wrote:

quoted

Ok, I tried with the parameters reported for a SATA SSD:

rpct=95.00 rlat=10000 wpct=95.00 wlat=20000 min=50.00 max=400.00

Sorry, I should have explained it with a lot more details.

There are two things - the cost model and qos params.  The default SSD
cost model parameters are derived by averaging a number of mainstream
SSD parameters.  As a ballpark, this can be good enough because while
the overall performance varied quite a bit from one ssd to another,
the relative cost of different types of IOs wasn't drastically
different.

However, this means that the performance baseline can easily be way
off from 100% depending on the specific device in use.  In the above,
you're specifying min/max which limits how far the controller is
allowed to adjust the overall cost estimation.  50% and 400% are
numbers which may make sense if the cost model parameter is expected
to fall somewhere around 100% - ie. if the parameters are for that
specific device.

In your script, you're using default model params but limiting vrate
range.  It's likely that your device is significantly slower than what
the default parameters are expecting.  However, because min vrate is
limited to 50%, it doesn't throttle below 50% of the estimated cost,
so if the device is significantly slower than that, nothing gets
controlled.

quoted

and with a simpler configuration [1]: one target doing random reads

And without QoS latency targets, the controller is purely going by
queue depth depletion which works fine for many usual workloads such
as larger reads and writes but isn't likely to serve low-concurrency
latency-sensitive IOs well.

quoted

and only four interferers doing sequential reads, with all the
processes (groups) having the same weight.

But there seemed to be little or no control on I/O, because the target
got only 1.84 MB/s, against 1.15 MB/s without any control.

So I tried with rlat=1000 and rlat=100.

And this won't do anything as all rlat/wlat does is regulating how the
overall vrate should be adjusted and it's being min'd at 50%.

quoted

Control did improve, with same results for both values of rlat.  The
problem is that these results still seem rather bad, both in terms of
throughput guaranteed to the target and in terms of total throughput.
Here are results compared with BFQ (throughputs measured in MB/s):

                          io.weight            BFQ
target's throughput        3.415                6.224        
total throughput           159.14               321.375

So, what should have been configured is something like

$ echo '8:0 enable=1 rpct=95 rlat=10000 wpct=95 wlat=20000' > /sys/fs/cgroup/io.cost.qos

which just says "target 10ms p(95) read latency and 20ms p(95) write
latency" without putting any restrictions on vrate range.

With that, I got the following on Micron_1100_MTFDDAV256TBN which is a
pretty old 256GB SATA drive.

 Aggregated throughput:
	   min         max         avg     std_dev     conf99%
	266.73      275.71      271.38     4.05144     45.7635
 Interfered total throughput:
	   min         max         avg     std_dev
	 9.608      13.008      10.941    0.664938

During the run, iocost-monitor.py looked like the following.

 sda RUN  per=40ms cur_per=2074.351:v1008.844 busy= +0 vrate= 59.85% params=ssd_dfl(CQ)
			    active    weight      hweight% inflt% del_ms usages%
 InterfererGroup0             *   100/  100  22.94/ 20.00   0.00  0*000 023:023:023
 InterfererGroup1             *   100/  100  22.94/ 20.00   0.00  0*000 023:023:023
 InterfererGroup2             *   100/  100  22.94/ 20.00   0.00  0*000 025:023:021
 InterfererGroup3             *   100/  100  22.94/ 20.00   0.00  0*000 023:023:023
 interfered                   *    36/  100   8.26/ 20.00   0.42  0*000 003:004:004

Note that interfered is reported to only use 3-4% of the disk capacity
while configured to consume 20%.  This is because with single
concurrency 4k randread job, its ability to consume IO capacity is
limited by the completion latency.

10ms is pretty generous (ie. more work-conserving) target for SSDs.
Let's say we're willing to tighten it to trade off total work for
tighter latency.

$ echo '8:0 enable=1 rpct=95 rlat=2500 wpct=95 wlat=5000' > /sys/fs/cgroup/io.cost.qos

 Aggregated throughput:
	   min         max         avg     std_dev     conf99%
	147.06      172.18     154.608      11.783     133.096
 Interfered total throughput:
	   min         max         avg     std_dev
	17.992       19.32      18.698    0.313105

and the monitoring output

 sda RUN  per=10ms cur_per=2927.152:v1556.138 busy= -2 vrate= 34.74% params=ssd_dfl(CQ)
			    active    weight      hweight% inflt% del_ms usages%
 InterfererGroup0             *   100/  100  20.00/ 20.00 386.11  0*000 070:020:020
 InterfererGroup1             *   100/  100  20.00/ 20.00 386.11  0*000 070:020:020
 InterfererGroup2             *   100/  100  20.00/ 20.00 386.11  0*000 070:020:020
 InterfererGroup3             *   100/  100  20.00/ 20.00   0.00  0*000 020:020:020
 interfered                   *   100/  100  20.00/ 20.00   1.21  0*000 010:014:017

The followings happened.

* The vrate is now hovering way lower.  The device is now doing less
 total work to acheive tighter completion latencies.

* The overall throughput dropped but interfered's utilization is now
 significantly higher along with its bandwidth from lower completion
 latencies.

For reference:

[Disabled]

 Aggregated throughput:
	   min         max         avg     std_dev     conf99%
	493.98      511.37     502.808     9.52773     107.621
 Interfered total throughput:
	   min         max         avg     std_dev
	 0.056       0.304       0.107   0.0691052

[Enabled, no QoS config]

 Aggregated throughput:
	   min         max         avg     std_dev     conf99%
	429.07      449.59     437.597     8.64952     97.7015
 Interfered total throughput:
	   min         max         avg     std_dev
	 0.456        3.12        1.08    0.774318

Thanks.

-- 
tejun

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help