Re: [PATCH RFC net-next 2/2] xsk: introduce a cached cq to temporarily store... | netdev

Re: [PATCH RFC net-next 2/2] xsk: introduce a cached cq to temporarily store descriptor addrs

From: Jason Xing <hidden>
Date: 2025-11-01 00:00:13
Also in: bpf

On Fri, Oct 31, 2025 at 10:02 PM Maciej Fijalkowski
[off-list ref] wrote:

On Fri, Oct 31, 2025 at 05:32:30PM +0800, Jason Xing wrote:

quoted

From: Jason Xing <kernelxing@tencent.com>

Before the commit 30f241fcf52a ("xsk: Fix immature cq descriptor
production"), there is one issue[1] which causes the wrong publish
of descriptors in race condidtion. The above commit fixes the issue
but adds more memory operations in the xmit hot path and interrupt
context, which can cause side effect in performance.

This patch tries to propose a new solution to fix the problem
without manipulating the allocation and deallocation of memory. One
of the key points is that I borrowed the idea from the above commit
that postpones updating the ring->descs in xsk_destruct_skb()
instead of in __xsk_generic_xmit().

The core logics are as show below:
1. allocate a new local queue. Only its cached_prod member is used.
2. write the descriptors into the local queue in the xmit path. And
   record the cached_prod as @start_addr that reflects the
   start position of this queue so that later the skb can easily
   find where its addrs are written in the destruction phase.
3. initialize the upper 24 bits of destructor_arg to store @start_addr
   in xsk_skb_init_misc().
4. Initialize the lower 8 bits of destructor_arg to store how many
   descriptors the skb owns in xsk_update_num_desc().
5. write the desc addr(s) from the @start_addr from the cached cq
   one by one into the real cq in xsk_destruct_skb(). In turn sync
   the global state of the cq.

The format of destructor_arg is designed as:
 ------------------------ --------
|       start_addr       |  num   |
 ------------------------ --------
Using upper 24 bits is enough to keep the temporary descriptors. And
it's also enough to use lower 8 bits to show the number of descriptors
that one skb owns.

[1]: https://lore.kernel.org/all/20250530095957.43248-1-e.kubanski@partner.samsung.com/ (local)

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
I posted the series as an RFC because I'd like to hear more opinions on
the current rought approach so that the fix[2] can be avoided and
mitigate the impact of performance. This patch might have bugs because
I decided to spend more time on it after we come to an agreement. Please
review the overall concepts. Thanks!

Maciej, could you share with me the way you tested jumbo frame? I used
./xdpsock -i enp2s0f1 -t -q 1 -S -s 9728 but the xdpsock utilizes the
nic more than 90%, which means I cannot see the performance impact.

Could you provide the command you used? Thanks :)

quoted

[2]:https://lore.kernel.org/all/20251030140355.4059-1-fmancera@suse.de/ (local)
---
 include/net/xdp_sock.h      |   1 +
 include/net/xsk_buff_pool.h |   1 +
 net/xdp/xsk.c               | 104 ++++++++++++++++++++++++++++--------
 net/xdp/xsk_buff_pool.c     |   1 +
 4 files changed, 84 insertions(+), 23 deletions(-)

(...)

quoted

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index aa9788f20d0d..6e170107dec7 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c

@@ -99,6 +99,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,

      pool->fq = xs->fq_tmp;
      pool->cq = xs->cq_tmp;
+     pool->cached_cq = xs->cached_cq;

Jason,

pool can be shared between multiple sockets that bind to same <netdev,qid>
tuple. I believe here you're opening up for the very same issue Eryk
initially reported.

Actually it shouldn't happen because the cached_cq is more of the
temporary array that helps the skb store its start position. The
cached_prod of cached_cq can only be increased, not decreased. In the
skb destruction phase, only those skbs that go to the end of life need
to sync its desc from cached_cq to cq. For some skbs that are released
before the tx completion, we don't need to clear its record in
cached_cq at all and cq remains untouched.

To put it in a simple way, the patch you proposed uses kmem_cached*
helpers to store the addr and write the addr into cq at the end of
lifecycle while the current patch uses a pre-allocated memory to
store. So it avoids the allocation and deallocation.

Unless I'm missing something important. If so, I'm still convinced
this temporary queue can solve the problem since essentially it's a
better substitute for kmem cache to retain high performance.

Thanks,
Jason

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help