Re: [PATCH v6 bpf] xsk: fix immature cq descriptor production

From: Jason Xing <hidden>
Date: 2025-08-27 00:26:03
Also in: bpf

On Wed, Aug 27, 2025 at 3:13 AM Maciej Fijalkowski
[off-list ref] wrote:

On Tue, Aug 26, 2025 at 09:03:45PM +0200, Maciej Fijalkowski wrote:

quoted

On Tue, Aug 26, 2025 at 08:23:04PM +0200, Magnus Karlsson wrote:

quoted

On Tue, 26 Aug 2025 at 18:07, Jason Xing [off-list ref] wrote:

quoted

On Wed, Aug 20, 2025 at 11:49 PM Maciej Fijalkowski
[off-list ref] wrote:

quoted

Eryk reported an issue that I have put under Closes: tag, related to
umem addrs being prematurely produced onto pool's completion queue.
Let us make the skb's destructor responsible for producing all addrs
that given skb used.

Introduce struct xsk_addrs which will carry descriptor count with array
of addresses taken from processed descriptors that will be carried via
skb_shared_info::destructor_arg. This way we can refer to it within
xsk_destruct_skb(). In order to mitigate the overhead that will be
coming from memory allocations, let us introduce kmem_cache of
xsk_addrs. There will be a single kmem_cache for xsk generic xmit on the
system.

Commit from fixes tag introduced the buggy behavior, it was not broken
from day 1, but rather when xsk multi-buffer got introduced.

Fixes: b7f72a30e9ac ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path")
Reported-by: Eryk Kubanski <redacted>
Closes: https://lore.kernel.org/netdev/20250530103456.53564-1-e.kubanski@partner.samsung.com/ (local)
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---

v1:
https://lore.kernel.org/bpf/20250702101648.1942562-1-maciej.fijalkowski@intel.com/ (local)
v2:
https://lore.kernel.org/bpf/20250705135512.1963216-1-maciej.fijalkowski@intel.com/ (local)
v3:
https://lore.kernel.org/bpf/20250806154127.2161434-1-maciej.fijalkowski@intel.com/ (local)
v4:
https://lore.kernel.org/bpf/20250813171210.2205259-1-maciej.fijalkowski@intel.com/ (local)
v5:
https://lore.kernel.org/bpf/aKXBHGPxjpBDKOHq@boxer/T/ (local)

v1->v2:
* store addrs in array carried via destructor_arg instead having them
  stored in skb headroom; cleaner and less hacky approach;
v2->v3:
* use kmem_cache for xsk_addrs allocation (Stan/Olek)
* set err when xsk_addrs allocation fails (Dan)
* change xsk_addrs layout to avoid holes
* free xsk_addrs on error path
* rebase
v3->v4:
* have kmem_cache as percpu vars
* don't drop unnecessary braces (unrelated) (Stan)
* use idx + i in xskq_prod_write_addr (Stan)
* alloc kmem_cache on bind (Stan)
* keep num_descs as first member in xsk_addrs (Magnus)
* add ack from Magnus
v4->v5:
* have a single kmem_cache per xsk subsystem (Stan)
v5->v6:
* free skb in xsk_build_skb_zerocopy() when xsk_addrs allocation fails
  (Stan)
* unregister netdev notifier if creating kmem_cache fails (Stan)

---
 net/xdp/xsk.c       | 95 +++++++++++++++++++++++++++++++++++++--------
 net/xdp/xsk_queue.h | 12 ++++++
 2 files changed, 91 insertions(+), 16 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 9c3acecc14b1..989d5ffb4273 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c

@@ -36,6 +36,13 @@
 #define TX_BATCH_SIZE 32
 #define MAX_PER_SOCKET_BUDGET 32

+struct xsk_addrs {
+       u32 num_descs;
+       u64 addrs[MAX_SKB_FRAGS + 1];
+};
+
+static struct kmem_cache *xsk_tx_generic_cache;

IMHO, adding a few heavy operations of allocating and freeing from
cache in the hot path is not a good choice. What I've been trying so
hard lately is to minimize the times of manipulating memory as much as
possible :( Memory hotspot can be easily captured by perf.

We might provide an new option in setsockopt() to let users
specifically support this use case since it does harm to normal cases?

Agree with you that we should not harm the normal case here. Instead
of introducing a setsockopt, how about we detect the case when this
can happen in the code? If I remember correctly, it can only occur in
the XDP_SHARED_UMEM mode were the xsk pool is shared between
processes. If this can be tested (by introducing a new bit in the xsk
pool if that is necessary), we could have two potential skb
destructors: the old one for the "normal" case and the new one with
the list of addresses to complete (using the expensive allocations and
deallocations) when it is strictly required i.e., when the xsk pool is
shared. Maciej, you are more in to the details of this, so what do you
think? Would something like this be a potential path forward?

Meh, i was focused on 9k mtu impact, it was about 5% on my machine but now
i checked small packets and indeed i see 12-14% perf regression.

I'll look into this so Daniel, for now let's drop this unfortunate
patch...

One more thing - Jason, you still need to focus your work on this approach
where we produce cq entries from destructor. I just need to come up with
smarter way of producing descs to be consumed by destructor :<

No problem. Before getting to that batch feature, I'm dealing with
AF_PACKET implementation first which probably takes much time than
needed.

Thanks,
Jason

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help