Re: Redux: Backwards compatibility for XDP multi-buff
From: Toke Høiland-Jørgensen <hidden>
Date: 2021-09-23 18:45:35
Also in:
bpf
Zvi Effron [off-list ref] writes:
On Wed, Sep 22, 2021 at 1:01 PM Toke Høiland-Jørgensen [off-list ref] wrote:quoted
Jakub Kicinski [off-list ref] writes:quoted
On Wed, 22 Sep 2021 00:20:19 +0200 Toke Høiland-Jørgensen wrote:quoted
quoted
quoted
Neither of those are desirable outcomes, I think; and if we add a separate "XDP multi-buff" switch, we might as well make it system-wide?If we have an internal flag 'this driver supports multi-buf xdp' cannot we make xdp_redirect to linearize in case the packet is being redirected to non multi-buf aware driver (potentially with corresponding non mb aware xdp progs attached) from mb aware driver?Hmm, the assumption that XDP frames take up at most one page has been fundamental from the start of XDP. So what does linearise mean in this context? If we get a 9k packet, should we dynamically allocate a multi-page chunk of contiguous memory and copy the frame into that, or were you thinking something else?My $.02 would be to not care about redirect at all. It's not like the user experience with redirect is anywhere close to amazing right now. Besides (with the exception of SW devices which will likely gain mb support quickly) mixed-HW setups are very rare. If the source of the redirect supports mb so will likely the target.It's not about device support it's about XDP program support: If I run an MB-aware XDP program on a physical interface and redirect the (MB) frame into a container, and there's an XDP program running inside that container that isn't MB-aware, bugs will ensue. Doesn't matter if the veth driver itself supports MB... We could leave that as a "don't do that, then" kind of thing, but that was what we were proposing (as the "do nothing" option) and got some pushback on, hence why we're having this conversation :) -TokeI hadn't even considered the case of redirecting to a veth pair on the same system. I'm assuming from your statement that the buffers are passed directly to the ingress inside the container and don't go through the sort of egress process they would if leaving the system? And I'm assuming that's as an optimization?
Yeah, if we redirect an XDP frame to a veth, the peer will get the same xdp_frame, without ever building an SKB.
I'm not sure that makes a difference, though. It's not about whether the driver's code is mb-capable, it's about whether the driver _as currently configured_ could generate multiple buffers. If it can, then only an mb-aware program should be able to be attached to it (and tail called from whatever's attached to it). If it can't, then there should be no way to have multiple buffers come to it. So in the situation you've described, either the veth driver should be in a state where it coalesces the multiple buffers into one, fragmenting the frame if necessary or drops the frame, or the program attached inside the container would need to be mb-aware. I'm assuming with the veth driver as written, this might mean that all programs attached to the veth driver would need to be mb-aware, which is obviously undesirable.
Hmm, I guess that as long as mb-frames only show up for large MTUs, the MTU of the veth device would be a limiting factor just like for physical devices, so we could just apply the same logic there. Not sure why I didn't consider that before :/
All of which significantly adds to the complexity to support mb-aware, so maybe this could be developed later? Initially we could have a sysctl toggling the state 0 single-buffer only, 1 multibuffer allowed. Then later we _could_ add a state for dynamic control once all XDP supporting drivers support the necessary dynamic functionality (if ever). At that point we'd have actual experience with the sysctl and could see how much of a burden having static control is. I may have been misinterpreting your use case though, and you were talking about the XDP program running on the egress side of the redirect? Is that what you were talking about case?
No I was talking about exactly what you outlined above. Although longer term, I also think we can use XDP mb as a way to avoid having to linearise SKBs when running XDP on them in veth (and for generic XDP) :) -Toke