Re: [PATCH] workqueue: Add WQ_SCHED_FIFO

From: Nathan Huckleberry <hidden>
Date: 2023-01-19 02:01:24
Also in: lkml

On Fri, Jan 13, 2023 at 1:11 PM Tejun Heo [off-list ref] wrote:

On Fri, Jan 13, 2023 at 01:07:02PM -0800, Nathan Huckleberry wrote:

quoted

Add a WQ flag that allows workqueues to use SCHED_FIFO with the least
imporant RT priority.  This can reduce scheduler latency for IO
post-processing when the CPU is under load without impacting other RT
workloads.  This has been shown to improve app startup time on Android
[1].

Scheduler latency affects several drivers as evidenced by [1], [2], [3],
[4].  Some of these drivers have moved post-processing into IRQ context.
However, this can cause latency spikes for real-time threads and jitter
related jank on Android.  Using a workqueue with SCHED_FIFO improves
scheduler latency without causing latency problems for RT threads.

[1]:
https://lore.kernel.org/linux-erofs/20230106073502.4017276-1-dhavale@google.com/ (local)
[2]:
https://lore.kernel.org/linux-f2fs-devel/20220802192437.1895492-1-daeho43@gmail.com/ (local)
[3]:
https://lore.kernel.org/dm-devel/20220722093823.4158756-4-nhuck@google.com/ (local)
[4]:
https://lore.kernel.org/dm-crypt/20200706173731.3734-1-ignat@cloudflare.com/ (local)

This change has been tested on dm-verity with the following fio config:

[global]
time_based
runtime=120

[do-verify]
ioengine=sync
filename=/dev/testing
rw=randread
direct=1

[burn_8x90%_qsort]
ioengine=cpuio
cpuload=90
numjobs=8
cpumode=qsort

Before:
clat (usec): min=13, max=23882, avg=29.56, stdev=113.29 READ:
bw=122MiB/s (128MB/s), 122MiB/s-122MiB/s (128MB/s-128MB/s), io=14.3GiB
(15.3GB), run=120001-120001msec

After:
clat (usec): min=13, max=23137, avg=19.96, stdev=105.71 READ:
bw=180MiB/s (189MB/s), 180MiB/s-180MiB/s (189MB/s-189MB/s), io=21.1GiB
(22.7GB), run=120012-120012msec

Hi Tejun,

Given that its use case mostly intersects with WQ_HIGHPRI, would it make
more sense to add a switch to alter its behavior instead? I don't really
like the idea of pushing the decision between WQ_HIGHPRI and WQ_SCHED_FIFO
to each user.

Do you think something similar should be done for WQ_UNBOUND? In most
places where WQ_HIGHPRI is used, WQ_UNBOUND is also used because it
boosts performance. However, I suspect that most of these benchmarks
were done on x86-64. I've found that WQ_UNBOUND significantly reduces
performance on arm64/Android.

From the documentation, using WQ_UNBOUND for performance doesn't seem
correct. It's only supposed to be used for long-running work. It might
make more sense to get rid of WQ_UNBOUND altogether and only move work
to unbound worker pools once it has stuck around for long enough.

Android will probably need to remove WQ_UNBOUND from all of these
performance critical users.

If there are performance benefits to using unbinding workqueues from
CPUs on x86-64, that should probably be a config flag, not controlled
by every user.

Thanks,
Huck

Thanks.

--
tejun

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help