Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
From: Paolo Valente <hidden>
Date: 2019-05-21 06:23:15
Also in:
cgroups, linux-ext4, linux-fsdevel, lkml
Il giorno 21 mag 2019, alle ore 00:45, Srivatsa S. Bhat [off-list ref] ha scritto: On 5/20/19 3:19 AM, Paolo Valente wrote:quoted
quoted
Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat [off-list ref] ha scritto: On 5/18/19 11:39 AM, Paolo Valente wrote:quoted
I've addressed these issues in my last batch of improvements for BFQ, which landed in the upcoming 5.2. If you give it a try, and still see the problem, then I'll be glad to reproduce it, and hopefully fix it for you.Hi Paolo, Thank you for looking into this! I just tried current mainline at commit 72cf0b07, but unfortunately didn't see any improvement: dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync With mq-deadline, I get: 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s With bfq, I get: 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/sHi Srivatsa, thanks for reproducing this on mainline. I seem to have reproduced a bonsai-tree version of this issue. Before digging into the block trace, I'd like to ask you for some feedback. First, in my test, the total throughput of the disk happens to be about 20 times as high as that enjoyed by dd, regardless of the I/O scheduler. I guess this massive overhead is normal with dsync, but I'd like know whether it is about the same on your side. This will help me understand whether I'll actually be analyzing about the same problem as yours.Do you mean to say the throughput obtained by dd'ing directly to the block device (bypassing the filesystem)?
No no, I mean simply what follows. 1) in one terminal: [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync 10000+0 record dentro 10000+0 record fuori 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s 2) In a second terminal, while the dd is in progress in the first terminal: $ iostat -tmd /dev/sda 3 Linux 5.1.0+ (localhost.localdomain) 20/05/2019 _x86_64_ (2 CPU) ... 20/05/2019 11:40:17 Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn sda 2288,00 0,00 9,77 0 29 20/05/2019 11:40:20 Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn sda 2325,33 0,00 9,93 0 29 20/05/2019 11:40:23 Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn sda 2351,33 0,00 10,05 0 30 ... As you can see, the overall throughput (~10 MB/s) is more than 20 times as high as the dd throughput (~350 KB/s). But the dd is the only source of I/O. Do you also see such a huge difference? Thanks, Paolo
That does give me a 20x speedup with bs=512, but much more with a bigger block size (achieving a max throughput of about 110 MB/s). dd if=/dev/zero of=/dev/sdc bs=512 count=10000 conv=fsync 10000+0 records in 10000+0 records out 5120000 bytes (5.1 MB, 4.9 MiB) copied, 0.15257 s, 33.6 MB/s dd if=/dev/zero of=/dev/sdc bs=4k count=10000 conv=fsync 10000+0 records in 10000+0 records out 40960000 bytes (41 MB, 39 MiB) copied, 0.395081 s, 104 MB/s I'm testing this on a Toshiba MG03ACA1 (1TB) hard disk.quoted
Second, the commands I used follow. Do they implement your test case correctly? [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs [root@localhost tmp]# cat /sys/block/sda/queue/scheduler [mq-deadline] bfq none [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync 10000+0 record dentro 10000+0 record fuori 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync 10000+0 record dentro 10000+0 record fuori 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/sYes, this is indeed the testcase, although I see a much bigger drop in performance with bfq, compared to the results from your setup. Regards, Srivatsa
Attachments
- signature.asc [application/pgp-signature] 833 bytes