Re: [PATCH net-next 2/2] net: devmem: use niov array for token management

From: Bobby Eshleman <hidden>
Date: 2025-09-08 16:16:13
Also in: lkml

On Fri, Sep 05, 2025 at 01:38:39PM -0700, Stanislav Fomichev wrote:

On 09/05, Mina Almasry wrote:

quoted

On Wed, Sep 3, 2025 at 5:15 PM Bobby Eshleman [off-list ref] wrote:

quoted

On Wed, Sep 03, 2025 at 01:20:57PM -0700, Mina Almasry wrote:

quoted

On Tue, Sep 2, 2025 at 2:36 PM Bobby Eshleman [off-list ref] wrote:

[...]

quoted

AFAIU, if you made sk_user_frags an array of (unref, binding) tuples
instead of just an array of urefs then you can remove the
single-binding restriction.

Although, I wonder what happens if the socket receives the netmem at
the same index on 2 different dmabufs. At that point I assume the
wrong uref gets incremented? :(

Right. We need some bits to differentiate bindings. Here are some ideas
I've had about this, I wonder what your thoughts are on them:

1) Encode a subset of bindings and wait for availability if the encoding
space becomes exhausted. For example, we could encode the binding in 5
bits for outstanding references across 32 bindings and 27 bits (512 GB)
of dmabuf. If recvmsg wants to return a reference to a 33rd binding, it
waits until the user returns enough tokens to release one of the binding
encoding bits (at which point it could be reused for the new reference).

This, I think, sounds reasonable. supporting up to 2^5 rx dmabuf
bindings at once and 2^27 max dmabuf size should be fine for us I
think. Although you have to be patient with me, I have to make sure
via tests and code inspection that these new limits will be OK. Also
please understand the risk that even if the changes don't break us,
they may break someone and have to be reverted anyway, although I
think the risk is small.

Another suggestion I got from the team is to use a bitmap instead of
an array of atomics. I initially thought this could work, but thinking
about it more, I think that would not work, no? Because it's not 100%
guaranteed that the socket will only get 1 ref on a net_iov. In the
case where the driver fragments the net_iov, multiple difference frags
could point to the same net_iov which means multiple refs. So it seems
we're stuck with an array of atomic_t.

Yes that is correct, multiple references can occur so the bitmap
will lose all > 1 references.

quoted

2) opt into an extended token (dmabuf_token_v2) via sockopts, and add
the binding ID or other needed information there.

Eh, I would say this is an overkill? Today the limit of dma-bufs
supported is 2^27 and I think the dmabuf size is technically 2^32 or
something, but I don't know that we need all this flexibility for
devmem tcp. I think adding a breakdown like above may be fine.

That's my inclination too.

quoted

One way or another the single-binding restriction needs to be removed
I think. It's regressing a UAPI that currently works.

Thinking about this more, if we can't figure out a different way and
have to have a strict 1 socket to 1 dma-buf mapping, that may be
acceptable...

...the best way to do it is actually to do this, I think, would be to
actually make sure the user can't break the mapping via `ethtool -N`.
I.e. when the user tells us to update or delete a flow steering rule
that belongs to a devmem socket, reject the request altogether. At
that point we could we can be sure that the mapping would not change
anyway. Although I don't know how feasible to implement this is.

I will look at this and see if it is possible, I do think that would be
nicer for the user.

quoted

AFAICT as well AF_XDP is in a similar boat to devmem in this regard.
The AF_XDP docs require flow steering to be configured for the data to
be available in the umem (and I assume, if flow steering is
reconfigured then the data disappears from the umem?). Stan do you
know how this works? If AF_XDP allows the user to break it by
reconfiguring flow steering it may also be reasonable to allow the
user to break a devmem socket as well (although maybe with
clarification in the docs.

For af_xdp, there is a check in xsk_rcv_check (that runs _after_ bpf
program that does the redirect exits) and it drops the packet if
the packet is redirected to the unxpected queue (not the one that
the socket was bound to). And the only way to observe it after the fact
is a drop counter on the nic (or, rather, depends on the nic wrt how it
accounts it). I remember Willem wanted to remove that restriction,
but looks like he never got to it?

tl;dr we don't error out explicitly when the user misconfigures the steering
after the fact (or initially) and drop the packet in the data path.

Sounds like a reasonable default if the explicit ethtool rejection isn't
feasible.

Thanks for the review and comments!

Best,
Bobby

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help