Thread (29 messages) 29 messages, 5 authors, 2025-11-20

Re: [PATCH net-next v4 00/14] netkit: Support for io_uring zero-copy and AF_XDP

From: David Wei <hidden>
Date: 2025-11-05 00:43:55
Also in: bpf

On 2025-11-04 15:22, Stanislav Fomichev wrote:
On 10/31, Daniel Borkmann wrote:
quoted
Containers use virtual netdevs to route traffic from a physical netdev
in the host namespace. They do not have access to the physical netdev
in the host and thus can't use memory providers or AF_XDP that require
reconfiguring/restarting queues in the physical netdev.

This patchset adds the concept of queue peering to virtual netdevs that
allow containers to use memory providers and AF_XDP at native speed.
These mapped queues are bound to a real queue in a physical netdev and
act as a proxy.

Memory providers and AF_XDP operations takes an ifindex and queue id,
so containers would pass in an ifindex for a virtual netdev and a queue
id of a mapped queue, which then gets proxied to the underlying real
queue. Peered queues are created and bound to a real queue atomically
through a generic ynl netdev operation.

We have implemented support for this concept in netkit and tested the
latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504
(bnxt_en) 100G NICs. For more details see the individual patches.

v3->v4:
  - ndo_queue_create store dst queue via arg (Nikolay)
  - Small nits like a spelling issue + rev xmas (Nikolay)
  - admin-perm flag in bind-queue spec (Jakub)
  - Fix potential ABBA deadlock situation in bind (Jakub, Paolo, Stan)
  - Add a peer dev_tracker to not reuse the sysfs one (Jakub)
  - New patch (12/14) to handle the underlying device going away (Jakub)
  - Improve commit message on queue-get (Jakub)
  - Do not expose phys dev info from container on queue-get (Jakub)
  - Add netif_put_rx_queue_peer_locked to simplify code (Stan)
  - Rework xsk handling to simplify the code and drop a few patches
  - Rebase and retested everything with mlx5 + bnxt_en
I mostly looked at patches 1-8 and they look good to me. Will it be
possible to put your sample runs from 13 and 14 into a selftest form? Even
if you require real hw, that should be doable, similar to
tools/testing/selftests/drivers/net/hw/devmem.py, right?
Thanks for taking a look. For io_uring at least, it requires both a
routable VIP that can be assigned to the netkit in a netns and a BPF
program for skb forwarding. I could add a selftest, but it'll be hard to
generalise across all envs. I'm hoping to get self contained QEMU VM
selftest support first. WDYT?
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help