RE: [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
From: "Fijalkowski, Maciej" <maciej.fijalkowski@intel.com>
Date: 2026-06-26 13:43:14
Also in:
bpf
On Fri, Jun 26, 2026 at 7:12 PM Larysa Zaremba [off-list ref] wrote:quoted
On Tue, Jun 23, 2026 at 03:32:34PM +0200, Maciej Fijalkowski wrote:quoted
From: Jason Xing <kernelxing@tencent.com> This patch is inspired by the check[1] from sashiko. It says when overflow happens, the address of cq to be published is invalid. Actually the severer thing is the whole process of publishing the address of cq in this particular case is not right: it should truely publish the address and advance the cached_prod in cq as long as it reads descriptors from txq. The following is the full analysis. xsk_drop_skb() is called in three places, which all discard a partially built multi-buffer skb: 1) xsk_build_skb() -EOVERFLOW error path: packet exceedsMAX_SKB_FRAGSquoted
quoted
2) __xsk_generic_xmit() post-loop cleanup: an invalid descriptor in the TX ring prevents the partial packet from completing 3) xsk_release(): socket close while xs->skb holds an incomplete packet In all three cases, the TX descriptors for the already-processed frags have been consumed from the TX ring (xskq_cons_release), and CQ slots have been reserved. However, xsk_drop_skb() calls xsk_consume_skb() which cancels the CQ reservations via xsk_cq_cancel_locked(). Since the buffer addresses never appear in the completion queue, userspace permanently loses track of these buffers. Fix this by letting consume_skb() trigger the existing xsk_destruct_skb destructor, which already submits buffer addresses to the CQ via xsk_cq_submit_addr_locked(). Note that cancelling the descriptors back to the TX ring (via xskq_cons_cancel_n) is not a appropriate option because an oversized packet that always exceeds MAX_SKB_FRAGS would be retried indefinitely, which is an obviously deadlock bug in the TX path. Also move the desc->addr assignment in xsk_build_skb() above the overflow check so that the current descriptor's address is recorded before a potential -EOVERFLOW jump to free_err, consistent with the zerocopy path in xsk_build_skb_zerocopy(). [1]:https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/ (local)quoted
This change looks good, but overflow case with only 1 descriptor worriesme. I presume you referred to xsk_build_skb_zerocopy()?quoted
In such cases, once we get to following code, kfree_skb() has alreadyhappened:quoted
if (err == -EOVERFLOW) { if (xs->skb) { /* Drop the packet */ xsk_inc_num_desc(xs->skb); xsk_drop_skb(xs->skb); } else { xsk_cq_cancel_locked(xs->pool, 1); xs->tx->invalid_descs++; } xskq_cons_release(xs->tx); } kfree_skb() should have resulted in submission of the single fat descriptor to xsk_cq_submit_addr_locked() via xsk_destruct_skb(), so far consistent withthe At least, in the NO_LINEAR case, xsk_skb_init_misc() is not called since the OVERFLOW skips this function, which means kfree_skb() doesn't invoke xsk_destruct_skb() to publish it in the CQ. So it's safe to cancel the cq reservation (in xsk_cq_cancel_locked(xs->pool, 1)).
(responding from outlook so apologies for any broken formatting) Yes, I have the same understanding here. However, how technically possible would it be to produce > MAX_SKB_FRAGS from a single AF_XDP descriptor? I know Sashiko has pointed this out and you came up with previous fix, but for valid descriptor it is simply not possible. And invalid descs wouldn't reach this function. I wouldn't like to stir up the pot too much so let us keep this code, but is there any way to give Sashiko additional context? I mean, for case where we would say *this can't happen*, will It be able to carry this information onwards?
Thanks, Jasonquoted
multi-descriptor bevaior you are proposing here. But what happens when we cancel a submitted CQ slot via xsk_cq_cancel_locked(xs->pool, 1) in the above code?quoted
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Txpath")quoted
quoted
Signed-off-by: Jason Xing <kernelxing@tencent.com> --- net/xdp/xsk.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index b970f30ea9b9..a7a83dc4546a 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c@@ -794,8 +794,11 @@ static void xsk_consume_skb(struct sk_buff*skb)quoted
quoted
static void xsk_drop_skb(struct sk_buff *skb) { - xdp_sk(skb->sk)->tx->invalid_descs += xsk_get_num_desc(skb); - xsk_consume_skb(skb); + struct xdp_sock *xs = xdp_sk(skb->sk); + + xs->tx->invalid_descs += xsk_get_num_desc(skb); + consume_skb(skb); + xs->skb = NULL; } static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,@@ -877,7 +880,7 @@ static struct sk_buff*xsk_build_skb_zerocopy(struct xdp_sock *xs,quoted
quoted
return ERR_PTR(-ENOMEM); /* in case of -EOVERFLOW that could happen below, - * xsk_consume_skb() will release this node as whole skb + * xsk_drop_skb() will release this node as whole skb * would be dropped, which implies freeing all list elements */ xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;@@ -969,6 +972,8 @@ static struct sk_buff *xsk_build_skb(structxdp_sock *xs,quoted
quoted
goto free_err; } + xsk_addr->addrs[xsk_addr->num_descs] = desc->addr; + if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) &&xp_mb_desc(desc))) {quoted
quoted
err = -EOVERFLOW; goto free_err;@@ -986,8 +991,6 @@ static struct sk_buff *xsk_build_skb(structxdp_sock *xs,quoted
quoted
skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE); refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc); - - xsk_addr->addrs[xsk_addr->num_descs] = desc->addr; } } -- 2.43.0