Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
From: "Paul E. McKenney" <paulmck@kernel.org>
Date: 2023-01-20 16:08:37
Also in:
linux-arm-kernel, linux-mm, lkml
On Fri, Jan 20, 2023 at 09:57:05AM +0100, Michal Hocko wrote:
On Thu 19-01-23 11:17:07, Paul E. McKenney wrote:quoted
On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:quoted
On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:quoted
On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney [off-list ref] wrote:[...]quoted
quoted
There are a couple of possibilities here. First, if I am remembering correctly, the time between the call_rcu() and invocation of the corresponding callback was taking multiple seconds, but that was because the kernel was built with CONFIG_LAZY_RCU=y in order to save power by batching RCU work over multiple call_rcu() invocations. If this is causing a problem for a given call site, the shiny new call_rcu_hurry() can be used instead. Doing this gets back to the old-school non-laziness, but can of course consume more power.That would not be the case because CONFIG_LAZY_RCU was not an option at the time I was profiling this issue. Laxy RCU would be a great option to replace this patch but unfortunately it's not the default behavior, so I would still have to implement this batching in case lazy RCU is not enabled.quoted
Second, there is a much shorter one-jiffy delay between the call_rcu() and the invocation of the corresponding callback in kernels built with either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose of this delay is to avoid lock contention, and so this delay is incurred only on CPUs that are queuing callbacks at a rate exceeding 16K/second. This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU invoking call_rcu() at least 16 times within a given jiffy will incur the added delay. The reason for this delay is the use of a separate ->nocb_bypass list. As Suren says, this bypass list is used to reduce lock contention on the main ->cblist. This is not needed in old-school kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y (including most datacenter kernels) because in that case the callbacks enqueued by call_rcu() are touched only by the corresponding CPU, so that there is no need for locks.I believe this is the reason in my profiled case.quoted
Third, if you are instead seeing multiple milliseconds of CPU consumed by call_rcu() in the common case (for example, without the aid of interrupts, NMIs, or SMIs), please do let me know. That sounds to me like a bug.I don't think I've seen such a case. Thanks for clarifications, Paul!Thanks for the explanation Paul. I have to say this has caught me as a surprise. There are just not enough details about the benchmark to understand what is going on but I find it rather surprising that call_rcu can induce a higher overhead than the actual kmem_cache_free which is the callback. My naive understanding has been that call_rcu is really fast way to defer the execution to the RCU safe context to do the final cleanup.If I am following along correctly (ha!), then your "induce a higher overhead" should be something like "induce a higher to-kfree() latency".Yes, this is expected.quoted
Of course, there already is a higher latency-to-kfree via call_rcu() than via a direct call to kfree(), and callback-offload CPUs that are being flooded with callbacks raise that latency a jiffy or so more in order to avoid lock contention. If this becomes a problem, the callback-offloading code can be a bit smarter about avoiding lock contention, but need to see a real problem before I make that change. But if there is a real problem I will of course fix it.I believe that Suren claims that the call_rcu is really visible in the exit_mmap case. Time-to-free actual vmas shouldn't really be material for that path. If that happens much more later on there could be some side effects by an increased memory consumption but that should be marginal. How fast exit_mmap really is should only depend on direct calls from that path. But I guess we need some specific numbers from Suren to be sure what is going on here.
Actually, Suren did discuss these (perhaps offlist) back in August. I was just being forgetful. :-/ Thanx, Paul