Re: [PATCH] mm: reduce spinlock contention in release_pages()

From: Michal Hocko <hidden>
Date: 2021-11-26 10:48:35
Also in: linux-mm, lkml

On Fri 26-11-21 14:50:44, Hao Lee wrote:

On Thu, Nov 25, 2021 at 10:18 PM Michal Hocko [off-list ref] wrote:

[...]

quoted

Could you share more about requirements for those? Why is unmapping in
any of their hot paths which really require low latencies? Because as
long as unmapping requires a shared resource - like lru lock - then you
have a bottle necks.

We deploy best-effort (BE) jobs (e.g. bigdata, machine learning) and
latency-critical (LC) jobs (e.g. map navigation, payments services) on the
same servers to improve resource utilization. The running time of BE jobs are
very short, but its memory consumption is large, and these jobs will run
periodically. The LC jobs are long-run services and are sensitive to delays
because jitters may cause customer churn.

Have you tried to isolate those workloads by memory cgroups? That could
help for lru lock at least. You are likely going to hit other locks on
the way though. E.g. zone lock in the page allocator but that might be
less problematic in the end. If you isolate your long running services
to a different NUMA node then you can get even less interaction.

If a batch of BE jobs are finished simultaneously, lots of memory are freed,
and spinlock contentions happen. BE jobs don't care about these contentions,
but contentions cause them to spend more time in kernel mode, and thus, LC
jobs running on the same cpu cores will be delayed and jitters occur. (The
kernel preemption is disabled on our servers, and we try not to separate
LC/BE using cpuset in order to achieve "complete mixture deployment"). Then
LC services people will complain about the poor service stability. This
scenario has occurred several times, so we want to find a way to avoid it.

It will be hard and a constant fight to get reasonably low latencies on
a non preemptible kernel. It would likely be better to partition CPUs
between latency sensitive and BE jobs. I can see how that might not be
really practical but especially with non-preemptible kernels you have a
large space for priority inversions that is hard to forsee or contain.
-- 
Michal Hocko
SUSE Labs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help