Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free
From: "Paul E. McKenney" <paulmck@kernel.org>
Date: 2023-01-18 18:35:11
Also in:
linux-arm-kernel, linux-mm, lkml
On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko [off-list ref] wrote:quoted
On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:quoted
On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko [off-list ref] wrote:quoted
On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:quoted
call_rcu() can take a long time when callback offloading is enabled. Its use in the vm_area_free can cause regressions in the exit path when multiple VMAs are being freed.What kind of regressions.quoted
To minimize that impact, place VMAs into a list and free them in groups using one call_rcu() call per group.Please add some data to justify this additional complexity.Sorry, should have done that in the first place. A 4.3% regression was noticed when running execl test from unixbench suite. spawn test also showed 1.6% regression. Profiling revealed that vma freeing was taking longer due to call_rcu() which is slow when RCU callback offloading is enabled.Could you be more specific? vma freeing is async with the RCU so how come this has resulted in a regression? Is there any heavy rcu_synchronize in the exec path? That would be an interesting information.No, there is no heavy rcu_synchronize() or any other additional synchronous load in the exit path. It's the call_rcu() which can block the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of other call_rcu()'s going on in parallel. Note that call_rcu() calls rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling revealed that this function was taking multiple ms (don't recall the actual number, sorry). Paul's explanation implied that this happens due to contention on the locks taken in this function. For more in-depth details I'll have to ask Paul for help :) This code is quite complex and I don't know all the details of RCU implementation.
There are a couple of possibilities here. First, if I am remembering correctly, the time between the call_rcu() and invocation of the corresponding callback was taking multiple seconds, but that was because the kernel was built with CONFIG_LAZY_RCU=y in order to save power by batching RCU work over multiple call_rcu() invocations. If this is causing a problem for a given call site, the shiny new call_rcu_hurry() can be used instead. Doing this gets back to the old-school non-laziness, but can of course consume more power. Second, there is a much shorter one-jiffy delay between the call_rcu() and the invocation of the corresponding callback in kernels built with either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose of this delay is to avoid lock contention, and so this delay is incurred only on CPUs that are queuing callbacks at a rate exceeding 16K/second. This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU invoking call_rcu() at least 16 times within a given jiffy will incur the added delay. The reason for this delay is the use of a separate ->nocb_bypass list. As Suren says, this bypass list is used to reduce lock contention on the main ->cblist. This is not needed in old-school kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y (including most datacenter kernels) because in that case the callbacks enqueued by call_rcu() are touched only by the corresponding CPU, so that there is no need for locks. Third, if you are instead seeing multiple milliseconds of CPU consumed by call_rcu() in the common case (for example, without the aid of interrupts, NMIs, or SMIs), please do let me know. That sounds to me like a bug. Or have I lost track of some other slow case? Thanx, Paul