Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one
From: Jon Kohler <hidden>
Date: 2025-12-03 15:36:00
Also in:
bpf, lkml
On Dec 3, 2025, at 3:47 AM, Sebastian Andrzej Siewior [off-list ref] wrote: !-------------------------------------------------------------------| CAUTION: External Email |-------------------------------------------------------------------! On 2025-12-02 18:32:23 [+0100], Jesper Dangaard Brouer wrote:quoted
quoted
quoted
quoted
+ napi_consume_skb(skb, 1);I wonder if this would have any side effects since tun_xdp_one() is not called by a NAPI.As far as I can tell, this napi_consume_skb is really just an artifact of how it was named and how it was traditionally used. Now this is really just a napi_consume_skb within a bh disable/enable section, which should meet the requirements of how that interface should be used (again, AFAICT)Yicks - this sounds super ugly. Just wrapping napi_consume_skb() in bh disable/enable section and then assuming you get the same protection as NAPI is really dubious. Cc Sebastian as he is trying to cleanup these kind of use-case, to make kernel preemption work.I am actually done with this. Wrapping napi_consume_skb(, 1) in bh-disable basically does the trick if called from outside-bh section as long as it is not an IRQ section. The reason is that the skb-head is cached in a per-CPU cache which accessed only within softirq/ NAPI context. So you can "return" skbs in NET_TX and have some around in NET_RX. Otherwise skb is returned directly to the slab allocator. If this about skb recycling, you using page_pool might help. This however also expects NAPI/ BH disabled context.quoted
quoted
quoted
quoted
@@ -2576,13 +2583,24 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) rcu_read_lock(); bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx); - for (i = 0; i < n; i++) { + num_skbs = napi_skb_cache_get_bulk(skbs, n);Its document said: """ * Must be called *only* from the BH context. “"”We’re in a bh_disable section here, is that not good enough?Again this feels very ugly and prone to painting ourselves into a corner, assuming BH-disabled sections have same protection as NAPI. (The napi_skb_cache_get/put function are operating on per CPU arrays without any locking.)This is okay. NAPI means BH is disabled. Nothing more. There are a few implications to it. The default path is process-context (kernel or userland) * IRQ * -> irq is handled via its handler with disabled interrupts -> handler raises NET_RX aka NAPI -> irq core is done with IRQ handling and notices softirqs have been raised. Disables BH and starts handling softirqs with enabled interrupts before returning back before the interruption. -> softirqs are handled with with BH disabled. * IRQ * fires again. -> irq is handled as previously and NET_RX is set again. -> irq core returns back to previously handled softirqs -> Once NET_RX is done, softirq core would be done and return back but since it noticed that NET_RX is pending (again) it does another round. This is how it normally works. If you disable-bh in process context (either manually via local_bh_disable() or via spin_lock_bh()) then you enter BH context. There is hardly a difference (in_serving_softirq() will report a different value but this should not matter to anyone outside the core code). Any IRQ that raises NET_RX here will not lead to handling softirqs because BH is disabled (this maps the "IRQ fires again" case from above). This is delayed until local_bh_enable(). Therefore protecting the per-CPU array with local_bh_disable() is okay but for PREEMPT_RT reasons, per-CPU data needs this local_lock_nested_bh() around it (as napi_skb_cache_get/put does).
Thanks, Sebastian - so if I’m reading this correct, it *is* fine to do the two following patterns, outside of NAPI: local_bh_disable(); skb = napi_build_skb(buf, len); local_bh_enable(); local_bh_disable(); napi_consume_skb(skb, 1); local_bh_enable(); If so, I wonder if it would be cleaner to have something like build_skb_bh(buf, len); consume_skb_bh(skb, 1); Then have those methods handle the local_bh enable/disable, so that the toggle was a property of a call, not a requirement of the call? Similar in concept to ieee80211_rx_ni() defined in net/mac80211.h (and there are a few others with this sort of pattern, like netif_tx_disable())