Re: [PATCH V6 00/18] blk-throttle: add .low limit
From: Shaohua Li <shli@kernel.org>
Date: 2017-09-05 21:02:33
Also in:
lkml
On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
quoted
Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li [off-list ref] ha scritto: Hi, cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not much for SSD. This patch set try to add a conservative limit for blk-throttle. It isn't a proportional scheduling, but can help prioritize cgroups. There are several advantages we choose blk-throttle: - blk-throttle resides early in the block stack. It works for both bio and request based queues. - blk-throttle is light weight in general. It still takes queue lock, but it's not hard to implement a per-cpu cache and remove the lock contention. - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The mechanism is proved to harm performance for fast SSD. The patch set add a new io.low limit for blk-throttle. It's only for cgroup2. The existing io.max is a hard limit throttling. cgroup with a max limit never dispatch more IO than its max limit. While io.low is a best effort throttling. cgroups with 'low' limit can run above their 'low' limit at appropriate time. Specifically, if all cgroups reach their 'low' limit, all cgroups can run above their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups will run according to their 'low' limit. So the 'low' limit could act as two roles, it allows cgroups using free bandwidth and it protects cgroups from their 'low' limit. An example usage is we have a high prio cgroup with high 'low' limit and a low prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low prio can run above its 'low' limit, so we don't waste the bandwidth. When the high prio cgroup runs and is below its 'low' limit, low prio cgroup will run under its 'low' limit. This will protect high prio cgroup to get more resources.Hi Shaohua,
Hi, Sorry for the late response.
I would like to ask you some questions, to make sure I fully understand how the 'low' limit and the idle-group detection work in your above scenario. Suppose that: the drive has a random-I/O peak rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and the low prio group has a 'low' limit of 10 MB/s. If - the high prio process happens to do, say, only 5 MB/s for a given long time - the low prio process constantly does greedy I/O - the idle-group detection is not being used then the low prio process is limited to 10 MB/s during all this time interval. And only 10% of the device bandwidth is utilized. To recover lost bandwidth through idle-group detection, we need to set a target IO latency for the high-prio group. The high prio group should happen to be below the threshold, and thus to be detected as idle, leaving the low prio group free too use all the bandwidth. Here are my questions: 1) Is all I wrote above correct?
Yes
2) In particular, maybe there are other better mechanism to saturate the bandwidth in the above scenario?
Assume it's the 4) below.
If what I wrote above is correct: 3) Doesn't fluctuation occur? I mean: when the low prio group gets full bandwidth, the latency threshold of the high prio group may be overcome, causing the high prio group to not be considered idle any longer, and thus the low prio group to be limited again; this in turn will cause the threshold to not be overcome any longer, and so on.
That's true. We try to mitigate the fluctuation by increasing the low prio cgroup bandwidth graduately though.
4) Is there a way to compute an appropriate target latency of the high prio group, if it is a generic group, for which the latency requirements of the processes it contains are only partially known or completely unknown? By appropriate target latency, I mean a target latency that enables the framework to fully utilize the device bandwidth while the high prio group is doing less I/O than its limit.
Not sure how we can do this. The device max bandwidth varies based on request size and read/write ratio. We don't know when the max bandwidth is reached. Also I think we must consider a case that the workloads never use the full bandwidth of a disk, which is pretty common for SSD (at least in our environment). Thanks, Shaohua