RE: [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx

From: "Fijalkowski, Maciej" <maciej.fijalkowski@intel.com>
Date: 2026-06-26 13:43:14
Also in: bpf

On Fri, Jun 26, 2026 at 7:12 PM Larysa Zaremba [off-list ref]
wrote:

quoted

On Tue, Jun 23, 2026 at 03:32:34PM +0200, Maciej Fijalkowski wrote:

quoted

From: Jason Xing <kernelxing@tencent.com>

This patch is inspired by the check[1] from sashiko. It says when
overflow happens, the address of cq to be published is invalid.
Actually the severer thing is the whole process of publishing the
address of cq in this particular case is not right: it should truely
publish the address and advance the cached_prod in cq as long as it
reads descriptors from txq.

The following is the full analysis.
xsk_drop_skb() is called in three places, which all discard a partially
built multi-buffer skb:
1) xsk_build_skb() -EOVERFLOW error path: packet exceeds

MAX_SKB_FRAGS

quoted

2) __xsk_generic_xmit() post-loop cleanup: an invalid descriptor in
   the TX ring prevents the partial packet from completing
3) xsk_release(): socket close while xs->skb holds an incomplete packet

In all three cases, the TX descriptors for the already-processed frags
have been consumed from the TX ring (xskq_cons_release), and CQ slots
have been reserved. However, xsk_drop_skb() calls xsk_consume_skb()
which cancels the CQ reservations via xsk_cq_cancel_locked(). Since
the buffer addresses never appear in the completion queue, userspace
permanently loses track of these buffers.

Fix this by letting consume_skb() trigger the existing xsk_destruct_skb
destructor, which already submits buffer addresses to the CQ via
xsk_cq_submit_addr_locked().

Note that cancelling the descriptors back to the TX ring (via
xskq_cons_cancel_n) is not a appropriate option because an oversized
packet that always exceeds MAX_SKB_FRAGS would be retried indefinitely,
which is an obviously deadlock bug in the TX path.

Also move the desc->addr assignment in xsk_build_skb() above the
overflow check so that the current descriptor's address is recorded
before a potential -EOVERFLOW jump to free_err, consistent with the
zerocopy path in xsk_build_skb_zerocopy().

[1]:

https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/ (local)

quoted

This change looks good, but overflow case with only 1 descriptor worries

me.

I presume you referred to xsk_build_skb_zerocopy()?

quoted

In such cases, once we get to following code, kfree_skb() has already

happened:

quoted

        if (err == -EOVERFLOW) {
                if (xs->skb) {
                        /* Drop the packet */
                        xsk_inc_num_desc(xs->skb);
                        xsk_drop_skb(xs->skb);
                } else {
                        xsk_cq_cancel_locked(xs->pool, 1);
                        xs->tx->invalid_descs++;
                }
                xskq_cons_release(xs->tx);
        }

kfree_skb() should have resulted in submission of the single fat descriptor to
xsk_cq_submit_addr_locked() via xsk_destruct_skb(), so far consistent with

the

At least, in the NO_LINEAR case, xsk_skb_init_misc() is not called
since the OVERFLOW skips this function, which means kfree_skb()
doesn't invoke xsk_destruct_skb() to publish it in the CQ. So it's
safe to cancel the cq reservation (in xsk_cq_cancel_locked(xs->pool,
1)).

(responding from outlook so apologies for any broken formatting)

Yes, I have the same understanding here. However, how technically
possible would it be to produce > MAX_SKB_FRAGS from a single
AF_XDP descriptor?

I know Sashiko has pointed this out and you came up with previous
fix, but for valid descriptor it is simply not possible. And invalid
descs wouldn't reach this function.

I wouldn't like to stir up the pot too much so let us keep this
code, but is there any way to give Sashiko additional context?
I mean, for case where we would say *this can't happen*, will
It be able to carry this information onwards?

Thanks,
Jason

quoted

multi-descriptor bevaior you are proposing here.

But what happens when we cancel a submitted CQ slot via
xsk_cq_cancel_locked(xs->pool, 1) in the above code?

quoted

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx

path")

quoted

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b970f30ea9b9..a7a83dc4546a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c

@@ -794,8 +794,11 @@ static void xsk_consume_skb(struct sk_buff

*skb)

quoted

 static void xsk_drop_skb(struct sk_buff *skb)
 {
-     xdp_sk(skb->sk)->tx->invalid_descs += xsk_get_num_desc(skb);
-     xsk_consume_skb(skb);
+     struct xdp_sock *xs = xdp_sk(skb->sk);
+
+     xs->tx->invalid_descs += xsk_get_num_desc(skb);
+     consume_skb(skb);
+     xs->skb = NULL;
 }

 static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,

@@ -877,7 +880,7 @@ static struct sk_buff

*xsk_build_skb_zerocopy(struct xdp_sock *xs,

quoted

                      return ERR_PTR(-ENOMEM);

              /* in case of -EOVERFLOW that could happen below,
-              * xsk_consume_skb() will release this node as whole skb
+              * xsk_drop_skb() will release this node as whole skb
               * would be dropped, which implies freeing all list elements
               */
              xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;

@@ -969,6 +972,8 @@ static struct sk_buff *xsk_build_skb(struct

xdp_sock *xs,

quoted

                              goto free_err;
                      }

+                     xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
+
                      if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) &&

xp_mb_desc(desc))) {

quoted

                              err = -EOVERFLOW;
                              goto free_err;

@@ -986,8 +991,6 @@ static struct sk_buff *xsk_build_skb(struct

xdp_sock *xs,

quoted

                      skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE);
                      refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
-
-                     xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
              }
      }

--
2.43.0

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help