Thread (52 messages) 52 messages, 8 authors, 2019-06-13

Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

From: Srivatsa S. Bhat <hidden>
Date: 2019-05-30 08:39:11
Also in: cgroups, linux-ext4, linux-fsdevel, lkml

On 5/23/19 4:32 PM, Srivatsa S. Bhat wrote:
On 5/22/19 7:30 PM, Srivatsa S. Bhat wrote:
quoted
On 5/22/19 3:54 AM, Paolo Valente wrote:
quoted
quoted
Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat [off-list ref] ha scritto:

On 5/22/19 2:09 AM, Paolo Valente wrote:
quoted
First, thank you very much for testing my patches, and, above all, for
sharing those huge traces!

According to the your traces, the residual 20% lower throughput that you
record is due to the fact that the BFQ injection mechanism takes a few
hundredths of seconds to stabilize, at the beginning of the workload.
During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
that you see without this new patch.  After that time, there
seems to be no loss according to the trace.

The problem is that a loss lasting only a few hundredths of seconds is
however not negligible for a write workload that lasts only 3-4
seconds.  Could you please try writing a larger file?
I tried running dd for longer (about 100 seconds), but still saw around
1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
mq-deadline and noop.
Ok, then now the cause is the periodic reset of the mechanism.

It would be super easy to fill this gap, by just gearing the mechanism
toward a very aggressive injection.  The problem is maintaining
control.  As you can imagine from the performance gap between CFQ (or
BFQ with malfunctioning injection) and BFQ with this fix, it is very
hard to succeed in maximizing the throughput while at the same time
preserving control on per-group I/O.
Ah, I see. Just to make sure that this fix doesn't overly optimize for
total throughput (because of the testcase we've been using) and end up
causing regressions in per-group I/O control, I ran a test with
multiple simultaneous dd instances, each writing to a different
portion of the filesystem (well separated, to induce seeks), and each
dd task bound to its own blkio cgroup. I saw similar results with and
without this patch, and the throughput was equally distributed among
all the dd tasks.
Actually, it turns out that I ran the dd tasks directly on the block
device for this experiment, and not on top of ext4. I'll redo this on
ext4 and report back soon.
With all your patches applied (including waker detection for the low
latency case), I ran four simultaneous dd instances, each writing to a
different ext4 partition, and each dd task bound to its own blkio
cgroup.  The throughput continued to be well distributed among the dd
tasks, as shown below (I increased dd's block size from 512B to 8KB
for these experiments to get double-digit throughput numbers, so as to
make comparisons easier).

bfq with low_latency = 1:

819200000 bytes (819 MB, 781 MiB) copied, 16452.6 s, 49.8 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17139.6 s, 47.8 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17251.7 s, 47.5 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17384 s, 47.1 kB/s

bfq with low_latency = 0:

819200000 bytes (819 MB, 781 MiB) copied, 16257.9 s, 50.4 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17204.5 s, 47.6 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17220.6 s, 47.6 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17348.1 s, 47.2 kB/s
 
Regards,
Srivatsa
VMware Photon OS
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help