Re: [PATCH net v4 0/5] xsk: fix meta and publish of cq issues

From: Jason Xing <hidden>
Date: 2026-05-26 23:27:38
Also in: bpf

On Wed, May 27, 2026 at 3:43 AM Maciej Fijalkowski
[off-list ref] wrote:

On Sat, May 23, 2026 at 07:49:00AM +0800, Jason Xing wrote:

quoted

On Sat, May 23, 2026 at 2:34 AM Maciej Fijalkowski
[off-list ref] wrote:

quoted

On Fri, May 22, 2026 at 09:48:39PM +0800, Jason Xing wrote:

quoted

On Fri, May 22, 2026 at 4:55 PM Jason Xing [off-list ref] wrote:

quoted

On Thu, May 21, 2026 at 10:24 PM Maciej Fijalkowski
[off-list ref] wrote:

quoted

On Thu, May 21, 2026 at 09:07:30PM +0800, Jason Xing wrote:

quoted

On Thu, May 21, 2026 at 9:00 PM Maciej Fijalkowski
[off-list ref] wrote:

quoted

On Thu, May 21, 2026 at 08:41:08PM +0800, Jason Xing wrote:

quoted

On Thu, May 21, 2026 at 8:24 PM Maciej Fijalkowski
[off-list ref] wrote:

quoted

On Wed, May 20, 2026 at 08:42:39AM +0800, Jason Xing wrote:

quoted

From: Jason Xing <kernelxing@tencent.com>

The series is the product of previous review from sashiko[1].

1) META
patch 1: address TOCTOU around metadata.

2) PUBLISH of CQ
patch 2: make sure xsk_addr->addrs[] can be published to cq when
         overflow occurs.
patch 3: keep cleaning up the continuation descs (more than 17) and
         publish its address when overflow occurs.
patch 4: like patch 3, but only handles the invalid descs cases.

[1]: https://lore.kernel.org/all/20260502200722.53960-1-kerneljasonxing@gmail.com/ (local)

---
V4
Link: https://lore.kernel.org/all/20260517063311.28921-1-kerneljasonxing@gmail.com/ (local)
1. correct the description of xmit path in patch 3 (sashiko)
2. move set logic into xmit path in patch 3 (Stan)

V3
Link: https://lore.kernel.org/all/20260515123018.80147-1-kerneljasonxing@gmail.com/ (local)
1. avoid breaking previous usage of sendto, and siliently handle
overflow case (Stan, sashiko)
2. add one particular exception process in patch 4 (sashiko)
3. adjust the selftest to make sure it passes in either virutal or
physical machines, which includes add usleep to support physical machine.

V2
Link: https://lore.kernel.org/all/20260510012310.88570-1-kerneljasonxing@gmail.com/ (local)
1. adjust selftests (Jakub)
2. add READ_ONCE in patch 1 (Stan)

FWIW I still get test failures (yes with patch 5 applied). PTAL.

Thanks for the test. But I've tried with ixgbe driver...

I noticed there are some flaky tests which have nothing to do with the
series. Can you confirm that it's not caused because of the series?

That explains the different results as i am using i40e/ice which have
multi-buffer support whereas ixgbe does not even support mbuf at XDP.
Broken tests are from mbuf cases.

That's weird. I never expected the failed tests to be about multi-buffer.

Are they the same as the output you attached last time? Or something
new? Could you please share it so that I can investigate the root
cause?

[...]

quoted

Sorry, Maciej. I managed to get one server with i40e nic but still
couldn't reproduce it. Can you try the attachment (that is the
replacement for v4-0005) instead? I removed those nasty CONT test
cases...

Ah, I think I eventually figured out a solution. Maciej, could you
please test the 2nd patch instead?

This patch reworks the CONTD test cases. Cross finger.

Please don't rush things here, I believe we need to think a bit more here.
I have second thoughts about overall approach.

My understanding wrt CQ was that it is a container that holds descriptors
which have been successfully transmitted. Now we want to add also leftover
descriptors from broken packets, which might confuse user space sides in
case they were relying on behavior described above.

The intent is right of course as we don't want to lose UMEM descs, but I
feel like we need a separate mechanism for that rather than putting
invalid descs to CQ.

I don't sense anything strange here if we stick to put those
invalid/overflowed descriptors into cq. AF_XDP is only a tunnel that
transfers the data. That's it. A bit like how the physical link works,
which means it possibly drops data because of congestion.

Upper protocol is used to guarantee when to (re)transmit a packet -
the mechanism is the ACK driven in terms of TCP. TCP is absolutely
capable of finding such an abnormal thing happening by checking the
seq of incoming ack. My takeaway from this is we don't need to
deliberately design new stuff to fulfill direct and immediate
communication.

AF_XDP is often used for L2/L3 forwarding, UDP, custom transports, so I
was afraid some existing solutions might be relying on CQ entries implying
successful Tx.

Generally this issue is highly unlikely yet a thing we need to address so
let's follow your approach, but for that we need to update documentation
and align ZC side so that we would not have to deviate test cases.

Sorry, I don't see the point why we need to align the ZC side rather
than skipping them in ZC mode simply?

quoted

CQ works somehow as a notification that tells user space whether the
kernel receives the data from the app and handles them. Without
putting them in the CQ, the only thing for userspace to do is simply
wait.

I don't follow the last part of the sentence but let's disregard it.
Userspace gets errno/retcode in generic xmit so it is aware of underlying
issues and then it's app job to act upon it.

Right.

quoted

With that said, IMHO, I cannot figure out why we need a separate queue
or something like that. Of course, a new notification that handles all
the possible/potential exceptions and contributes to the performance
of the upper layer is worth a try :) The latter is crucial.

We discussed with Magnus it would be good to have a dedicated xsk_queue
stat for that case, such as 'oversized_descs' which would be bumped by the
amount of descs produced to CQ.

That is surely necessary, which was actually put into my todo list as
well as skb drop reason in the xmit path. But it's a net-next/bpf-next
material.

quoted

Does it make sense?

Besides, even though we would stay with proposed changes, behavior between
modes should be aligned. Right now ZC seems to be broken in touched
regions here - when we hit the limit of frags via pool->xdp_zc_max_segs,
we break the loop and discard the packet, never post it to CQ and these
descs are lost from user space POV. Then we would continue on next call
and interpret the rest of too big packet as a separate one (clamped) and
therefore submit corrupted packet to HW.

Right, this is how the previous selftests changes pollute the
subsequent tests after that. I think the new version of the attachment
should pass all the tests since I put all the CONTD tests separately
into another two functions? It's pointless to test those in the zc
mode.

We need analogous fix on ZC, then no such quirks in tests should exist.

I don't regard it as a quirk honestly. I'm curious what is the
advantage of unifying it just for this case? ZC mode doesn't need to
test the overflowed skb that appears in the __xsk_generic_xmit path,
right?

Test is doing the same thing regardless of underlying mode. Only the range
differ (MAX_SKB_FRAGS vs pool->xdp_zc_max_segs).

To wrap up, I see it like this, moving forward:
1. fix docs
2. add ss stat
3. wait for me with ZC fixes (I'm slow!)
4. inspect if tests will fly

Let me know your thoughts! Maybe Stan wants to chime in?

Roughly I agree with the points if you insist on working on ZC fixes.
I can wait for you :)

My plan is as follows:
1. only submit the 0001 patch only in v5
2. wait for your selftests fixes and then submit that series
3. move on with batch xmit since I don't expect this series to be a
blocking point

Thanks,
Jason

Maciej

quoted

As to the series, if no objections or any suggestions jump into the
thread, I'll post the series within a week.

Thanks,
Jason

quoted

I'll be looking at ZC API but i do think we need a common approach,
mode-agnostic.

Thanks,
Maciej

quoted

Thanks,
Jason

quoted

Really I don't think I have much time to spend on these tests which
makes me feel extremely annoyed... It's not easy to analyze the code
without a reproducer. The good news is that now I highly suspect that
this kind of CONT test cases pollute the whole cq which affects other
tests. Before I give up on the 0003/0004 patches, I'd like to hear
some advice from you. Thank you.

My original intention was to push batch xmit forward but at that time
sashiko pointed out some unrelated bugs accidentally.

Thanks,
Jason

quoted

Thanks,
Jason

quoted

Thanks,
Jason

quoted


Jason Xing (5):
  xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
  xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
  xsk: drain continuation descs after overflow in xsk_build_skb()
  xsk: drain continuation descs on invalid descriptor in
    __xsk_generic_xmit()
  selftests/xsk: drain CQ to wait for TX completion

 include/net/xdp_sock.h                        |  1 +
 net/xdp/xsk.c                                 | 44 +++++++++++++----
 .../selftests/bpf/prog_tests/test_xsk.c       | 48 +++++++++++--------
 3 files changed, 63 insertions(+), 30 deletions(-)

--
2.43.7

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help