Re: [PATCH net-next 2/2] net: devmem: use niov array for token management
From: Bobby Eshleman <hidden>
Date: 2025-09-08 16:16:13
Also in:
lkml
On Fri, Sep 05, 2025 at 01:38:39PM -0700, Stanislav Fomichev wrote:
On 09/05, Mina Almasry wrote:quoted
On Wed, Sep 3, 2025 at 5:15 PM Bobby Eshleman [off-list ref] wrote:quoted
On Wed, Sep 03, 2025 at 01:20:57PM -0700, Mina Almasry wrote:quoted
On Tue, Sep 2, 2025 at 2:36 PM Bobby Eshleman [off-list ref] wrote:
[...]
quoted
quoted
quoted
AFAIU, if you made sk_user_frags an array of (unref, binding) tuples instead of just an array of urefs then you can remove the single-binding restriction. Although, I wonder what happens if the socket receives the netmem at the same index on 2 different dmabufs. At that point I assume the wrong uref gets incremented? :(Right. We need some bits to differentiate bindings. Here are some ideas I've had about this, I wonder what your thoughts are on them: 1) Encode a subset of bindings and wait for availability if the encoding space becomes exhausted. For example, we could encode the binding in 5 bits for outstanding references across 32 bindings and 27 bits (512 GB) of dmabuf. If recvmsg wants to return a reference to a 33rd binding, it waits until the user returns enough tokens to release one of the binding encoding bits (at which point it could be reused for the new reference).This, I think, sounds reasonable. supporting up to 2^5 rx dmabuf bindings at once and 2^27 max dmabuf size should be fine for us I think. Although you have to be patient with me, I have to make sure via tests and code inspection that these new limits will be OK. Also please understand the risk that even if the changes don't break us, they may break someone and have to be reverted anyway, although I think the risk is small. Another suggestion I got from the team is to use a bitmap instead of an array of atomics. I initially thought this could work, but thinking about it more, I think that would not work, no? Because it's not 100% guaranteed that the socket will only get 1 ref on a net_iov. In the case where the driver fragments the net_iov, multiple difference frags could point to the same net_iov which means multiple refs. So it seems we're stuck with an array of atomic_t.
Yes that is correct, multiple references can occur so the bitmap will lose all > 1 references.
quoted
quoted
2) opt into an extended token (dmabuf_token_v2) via sockopts, and add the binding ID or other needed information there.Eh, I would say this is an overkill? Today the limit of dma-bufs supported is 2^27 and I think the dmabuf size is technically 2^32 or something, but I don't know that we need all this flexibility for devmem tcp. I think adding a breakdown like above may be fine.
That's my inclination too.
quoted
quoted
quoted
One way or another the single-binding restriction needs to be removed I think. It's regressing a UAPI that currently works.Thinking about this more, if we can't figure out a different way and have to have a strict 1 socket to 1 dma-buf mapping, that may be acceptable... ...the best way to do it is actually to do this, I think, would be to actually make sure the user can't break the mapping via `ethtool -N`. I.e. when the user tells us to update or delete a flow steering rule that belongs to a devmem socket, reject the request altogether. At that point we could we can be sure that the mapping would not change anyway. Although I don't know how feasible to implement this is.
I will look at this and see if it is possible, I do think that would be nicer for the user.
quoted
AFAICT as well AF_XDP is in a similar boat to devmem in this regard. The AF_XDP docs require flow steering to be configured for the data to be available in the umem (and I assume, if flow steering is reconfigured then the data disappears from the umem?). Stan do you know how this works? If AF_XDP allows the user to break it by reconfiguring flow steering it may also be reasonable to allow the user to break a devmem socket as well (although maybe with clarification in the docs.For af_xdp, there is a check in xsk_rcv_check (that runs _after_ bpf program that does the redirect exits) and it drops the packet if the packet is redirected to the unxpected queue (not the one that the socket was bound to). And the only way to observe it after the fact is a drop counter on the nic (or, rather, depends on the nic wrt how it accounts it). I remember Willem wanted to remove that restriction, but looks like he never got to it? tl;dr we don't error out explicitly when the user misconfigures the steering after the fact (or initially) and drop the packet in the data path.
Sounds like a reasonable default if the explicit ethtool rejection isn't feasible. Thanks for the review and comments! Best, Bobby