Re: [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim

From: Stanislav Fomichev <hidden>
Date: 2026-06-25 16:05:46
Also in: bpf

On 06/25, Jason Xing wrote:

On Thu, Jun 25, 2026 at 12:37 AM Maciej Fijalkowski
[off-list ref] wrote:

quoted

On Wed, Jun 24, 2026 at 08:38:20AM -0700, Stanislav Fomichev wrote:

quoted

On 06/23, Maciej Fijalkowski wrote:

quoted

Hi,

This series fixes several AF_XDP multi-buffer Tx paths where descriptors
consumed from the Tx ring are not consistently returned to userspace
through the completion ring when the packet is later dropped as invalid.

The affected cases are invalid or oversized multi-buffer Tx packets in
both the generic and zero-copy paths. In these cases, the kernel can
consume one or more Tx descriptors while building or validating a
multi-buffer packet, then drop the packet before it reaches the device.
Userspace still owns the UMEM buffers only after the corresponding
addresses are returned through the CQ. Missing completions therefore
make userspace lose track of those buffers.

The generic path fixes cover three related cases:
* partially built multi-buffer skbs dropped by xsk_drop_skb();
  continuation descriptors left in the Tx ring after xsk_build_skb()
  reports overflow;
* invalid descriptors encountered in the middle of a multi-buffer
  packet, including the offending invalid descriptor itself.

The zero-copy path is handled separately. The batched Tx parser now
distinguishes descriptors that can be passed to the driver from
descriptors that are consumed only because they belong to an invalid
multi-buffer packet. Reclaim-only descriptors are written to the CQ
address area and published in completion order, after any earlier
driver-visible Tx descriptors.

The ZC batching path can also retain drain state when userspace has not
yet provided the end of an invalid multi-buffer packet. To keep this
state local to the singular batched path, the series prevents a second
Tx socket from joining the same pool while such drain state exists.
During the singular-to-shared transition, Tx batching is gated,
pre-existing readers are waited out, and bind fails with -EAGAIN if the
existing socket still has pending drain state. This avoids adding
multi-buffer drain handling to the shared-UMEM fallback path.

The last two patches update xskxceiver so the tests account invalid
multi-buffer Tx packets as descriptors that must be reclaimed, while
still not expecting those invalid packets on the Rx side.

This is a follow-up to Jason's changes [0] which were addressing generic
xmit only and this set allows me to pass full xskxceiver test suite run
against ice driver.

There is a fair amount of feedback from sashiko already :-( So the meta
question from me is: is it time to scrap our current approach where
we parse descriptor by descriptor? (and maintain half-baked skb and
half-consumed descriptor queues)

Should we:

1. do desc[MAX_SKB_FRAGS] and xskq_cons_peek_desc until we exhaust
PKT_CONT (if the last packet has PKT_CONT, return EOVERFLOW to userspace
and do a full stop here)
2. now that we really know the number of valid descriptors -> reserve
the cq space (if not -> EAGAIN)
3. pre-allocate everything here (if at any point we have ENOMEM -> cleanup
locally, don't ever create semi-initialized skb)
4. construct the skb
5. xmit

Yeah generic xmit became utterly horrible, haven't gone through sashiko
reviews yet, but bare in mind this set also aligns zc side to what was
previously being addressed by Jason.

I believe planned logistics were to get these fixes onto net and then
Jason had an implementation of batching on generic xmit, directed towards
-next and that's where we could address current flow.

Agreed. That's what I'm hoping for. There would be much more
discussion on how to do batch xmit in an elegant way, I believe.

This doesn't have to depend on the batch rewrite, we should be able to rewrite
this non-zc in net, this is still technically fixes, not feature work..

There was already a couple of revisions with this drain_cont approach
and every time I look at it feels like the cure is worse than the
decease :-( Obviously not gonna stop you from going with the current approach,
but these fixes feel a bit of a wasted effort to me (since the bugs keep
coming and we are piling more complexity).

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help