Thread (181 messages) 181 messages, 12 authors, 2023-11-22

Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

From: "Paul E. McKenney" <paulmck@kernel.org>
Date: 2023-01-20 16:08:37
Also in: linux-arm-kernel, linux-mm, lkml

On Fri, Jan 20, 2023 at 09:57:05AM +0100, Michal Hocko wrote:
On Thu 19-01-23 11:17:07, Paul E. McKenney wrote:
quoted
On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
quoted
On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
quoted
On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney [off-list ref] wrote:
[...]
quoted
quoted
There are a couple of possibilities here.

First, if I am remembering correctly, the time between the call_rcu()
and invocation of the corresponding callback was taking multiple seconds,
but that was because the kernel was built with CONFIG_LAZY_RCU=y in
order to save power by batching RCU work over multiple call_rcu()
invocations.  If this is causing a problem for a given call site, the
shiny new call_rcu_hurry() can be used instead.  Doing this gets back
to the old-school non-laziness, but can of course consume more power.
That would not be the case because CONFIG_LAZY_RCU was not an option
at the time I was profiling this issue.
Laxy RCU would be a great option to replace this patch but
unfortunately it's not the default behavior, so I would still have to
implement this batching in case lazy RCU is not enabled.
quoted
Second, there is a much shorter one-jiffy delay between the call_rcu()
and the invocation of the corresponding callback in kernels built with
either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
on CPUs mentioned in the rcu_nocbs kernel boot parameters).  The purpose
of this delay is to avoid lock contention, and so this delay is incurred
only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
invoking call_rcu() at least 16 times within a given jiffy will incur
the added delay.  The reason for this delay is the use of a separate
->nocb_bypass list.  As Suren says, this bypass list is used to reduce
lock contention on the main ->cblist.  This is not needed in old-school
kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
(including most datacenter kernels) because in that case the callbacks
enqueued by call_rcu() are touched only by the corresponding CPU, so
that there is no need for locks.
I believe this is the reason in my profiled case.
quoted
Third, if you are instead seeing multiple milliseconds of CPU consumed by
call_rcu() in the common case (for example, without the aid of interrupts,
NMIs, or SMIs), please do let me know.  That sounds to me like a bug.
I don't think I've seen such a case.
Thanks for clarifications, Paul!
Thanks for the explanation Paul. I have to say this has caught me as a
surprise. There are just not enough details about the benchmark to
understand what is going on but I find it rather surprising that
call_rcu can induce a higher overhead than the actual kmem_cache_free
which is the callback. My naive understanding has been that call_rcu is
really fast way to defer the execution to the RCU safe context to do the
final cleanup.
If I am following along correctly (ha!), then your "induce a higher
overhead" should be something like "induce a higher to-kfree() latency".
Yes, this is expected.
quoted
Of course, there already is a higher latency-to-kfree via call_rcu()
than via a direct call to kfree(), and callback-offload CPUs that are
being flooded with callbacks raise that latency a jiffy or so more in
order to avoid lock contention.

If this becomes a problem, the callback-offloading code can be a bit
smarter about avoiding lock contention, but need to see a real problem
before I make that change.  But if there is a real problem I will of
course fix it.
I believe that Suren claims that the call_rcu is really visible in the
exit_mmap case. Time-to-free actual vmas shouldn't really be material
for that path. If that happens much more later on there could be some
side effects by an increased memory consumption but that should be
marginal. How fast exit_mmap really is should only depend on direct
calls from that path.

But I guess we need some specific numbers from Suren to be sure what is
going on here.
Actually, Suren did discuss these (perhaps offlist) back in August.
I was just being forgetful.  :-/

							Thanx, Paul
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help