Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
From: Jesper Dangaard Brouer <hidden>
Date: 2017-04-26 09:12:05
On Tue, 25 Apr 2017 20:07:34 -0700 John Fastabend [off-list ref] wrote:
On 17-04-25 05:26 PM, Alexei Starovoitov wrote:quoted
On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote:quoted
quoted
Note the very first bpf patchset years ago contained the port table abstraction. ovs has concept of vports as well. These two very different projects needed port table to provide a layer of indirection between ifindex==netdev and virtual port number. This is still the case and I'd like to see this port table to be implemented for both cls_bpf and xdp. In that sense xdp is not special.Glad to hear you want to see this implemented, I will start coding on this then. Good point with cls_bpf, I was planning to make this port table strongly connected to XDP, guess I should also think of cls_bpf.perfect. I think we should try to make all additions to bpf networking world to be usable for both tc and xdp, since both are actively used and it wouldn't be great to have cool feature for one, but not the other. I think port table is an excellent candidate that applies to both.+1 Jesper, I was working up the code for the redirect piece for ixgbe and virtio, please use this as a base for your virtual port number table. I'll push an update onto github tomorrow. I think the table should drop in fairly nicely.
Cool, I will do that. Then, I'll also have a redirect method to shape this around, and I would have to benchmark/test your ixgbe redirect. (John please let me know, what github tree we are talking about, and what branch)
One piece that isn't clear to me is how do you plan to instantiate and program this table. Is it a new static bpf map that is created any time we see the redirect command? I think this would be preferred.
(This is difficult to explain without us misunderstanding each-other) As Alexei also mentioned before, ifindex vs port makes no real difference seen from the bpf program side. It is userspace's responsibility to add ifindex/port's to the bpf-maps, according to how the bpf program "policy" want to "connect" these ports. The port-table system add one extra step, of also adding this port to the port-table (which lives inside the kernel). When loading the XDP program, we also need to pass along a port table "id" this XDP program is associated with (and if it doesn't exists you create it). And your userspace "control-plane" application also need to know this port table "id", when adding a new port. The concept of having multiple port tables is key. As this implies we can have several simultaneous "data-planes" that is *isolated* from each-other. Think about how network-namespaces/containers want isolation. A subtle thing I'm afraid to mention, is that oppose to the ifindex model, a port table with mapping to a net_device pointer, would allow (faster) delivery into the container's inner net_device, which sort of violates the isolation, but I would argue it is not a problem as this net_device pointer could only be added from a process within the namespace. I like this feature, but it could easily be disallowed via port insertion-time validation.
quoted
quoted
I'm not worried about the DROP case, I agree that is fine (as you also say). The problem is unintentionally sending a packet to a wrong ifindex. This is clearly an eBPF program error, BUT with XDP this becomes a very hard to debug program error. With TC-redirect/cls_bpf we can tcpdump the packets, with XDP there is no visibility into this happening (the NSA is going to love this "feature"). Maybe we could add yet-another tracepoint to allow debugging this. My proposal that we simply remove the possibility for such program errors, by as you say move the validation from run-time into static insertion-time, via a port table.I think lack of tcpdump-like debugging in xdp is a separate issue. As I was saying in the other thread we have trivial 'xdpdump' kern+user app that emits pcap file, but it's too specific to how we use tail_calls+prog_array in our xdp setup. I'm working on the program chaining that will be generic and allow us transparently add multiple xdp or tc progs to the same attachment point and will allow us to do 'xdpdump' at any point of this pipeline, so debugging of what happened to the packet will be easier and done in the same way for both tc and xdp. btw in our experience working with both tc and xdp the tc+bpf was actually harder to use and more bug prone.Nice, the tcpdump-like debugging looks interesting.
Yes, this xdpdump sound like a very useful tool. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer