Re: [RFC 4/4] net/ipv4/fib: Don't synchronise_rcu() every 512Kb
From: Dmitry Safonov <hidden>
Date: 2019-03-26 23:14:52
Also in:
lkml
On 3/26/19 3:39 PM, David Ahern wrote:
On 3/26/19 9:30 AM, Dmitry Safonov wrote:quoted
Fib trie has a hard-coded sync_pages limit to call synchronise_rcu(). The limit is 128 pages or 512Kb (considering common case with 4Kb pages). Unfortunately, at Arista we have use-scenarios with full view software forwarding. At the scale of 100K and more routes even on 2 core boxes the hard-coded limit starts actively shooting in the leg: lockup detector notices that rtnl_lock is held for seconds. First reason is previously broken MAX_WORK, that didn't limit pending balancing work. While fixing it, I've noticed that the bottle-neck is actually in the number of synchronise_rcu() calls. I've tried to fix it with a patch to decrement number of tnodes in rcu callback, but it hasn't much affected performance. One possible way to "fix" it - provide another sysctl to control sync_pages, but in my POV it's nasty - exposing another realisation detail into user-space.well, that was accepted last week. ;-) commit 9ab948a91b2c2abc8e82845c0e61f4b1683e3a4f Author: David Ahern [off-list ref] Date: Wed Mar 20 09:18:59 2019 -0700 ipv4: Allow amount of dirty memory from fib resizing to be controllable Can you see how that change (should backport easily) affects your test case? From my perspective 16MB was the sweet spot.
FWIW, I would like to +Cc Paul here.
TLDR; we're looking with David into ways to improve a hardcoded limit
tnode_free_size at net/ipv4/fib_trie.c: currently it's way too low
(512Kb). David created a patch to provide sysctl that controls the limit
and it would solve a problem for both of us. In parallel, I thought that
exposing this to userspace is not much fun and added a shrinker with
synchronize_rcu(). I'm not any sure that the latter is actually a sane
solution..
Is there any guarantee that memory to-be freed by call_rcu() will get
freed in OOM conditions? Might there be a chance that we don't need any
limit here at all?
Worth to mention that I don't argue David's patch as I pointed that it
would (will) solve the problem for us both, but with good intentions
wondering if we can do something here rather a new sysctl knob.
Thanks,
Dmitry