Re: [net-next v6 0/9] Add support for per-NAPI config via netlink
From: Joe Damato <hidden>
Date: 2024-12-18 17:09:03
Also in:
linux-doc, linux-rdma, lkml
On Wed, Dec 18, 2024 at 11:22:33AM +0000, Alex Lazar wrote:
Hi Joe and all,
I am part of the NVIDIA Eth drivers team, and we are experiencing a problem,
sibesced to this change: commit 86e25f40aa1e ("net: napi: Add napi_config")
The issue occurs when sending packets from one machine to another.
On the receiver side, we have XSK (XDPsock) that receives the packet and sends it
back to the sender.
At some point, one packet (packet A) gets "stuck," and if we send a new packet
(packet B), it "pushes" the previous one. Packet A is then processed by the NAPI
poll, and packet B gets stuck, and so on.
Your change involves moving napi_hash_del() and napi_hash_add() from
netif_napi_del() and netif_napi_add_weight() to napi_enable() and napi_disable().
If I move them back to netif_napi_del() and netif_napi_add_weight(),
the issue is resolved (I moved the entire if/else block, not just the napi_hash_del/add).
This issue occurs with both the new and old APIs (netif_napi_add/_config).
Moving the napi_hash_add() and napi_hash_del() functions resolves it for both.
I am debugging this, no breakthrough so far.
I would appreciate if you could look into this.
We can provide more details per request.
I appreciate your report, but there is not a lot in your message to
help debug the issue.
Can you please:
1.) Verify that the kernel tree you are testing on has commit
cecc1555a8c2 ("net: Make napi_hash_lock irq safe") included ? If it
does not, can you pull in that commit and re-run your test and
report back if that fixes your problem?
2.) If (1) does not fix your problem, can you please reply with at
least the following information:
- Specify what device this is happening on (in case I have access
to one)
- Which driver is affected
- Which upstream kernel SHA you are building your test kernel from
- The reproducer program(s) with clear instructions on how exactly
to run it/them in order to reproduce the issue
Thanks,
Joe