Thread (34 messages) 34 messages, 7 authors, 2017-04-30

Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?

From: Jesper Dangaard Brouer <hidden>
Date: 2017-04-26 09:12:05

On Tue, 25 Apr 2017 20:07:34 -0700
John Fastabend [off-list ref] wrote:
On 17-04-25 05:26 PM, Alexei Starovoitov wrote:
quoted
On Tue, Apr 25, 2017 at 11:34:53AM +0200, Jesper Dangaard Brouer wrote:  
quoted
quoted
Note the very first bpf patchset years ago contained the port table
abstraction. ovs has concept of vports as well. These two very
different projects needed port table to provide a layer of
indirection between ifindex==netdev and virtual port number.
This is still the case and I'd like to see this port table to be
implemented for both cls_bpf and xdp. In that sense xdp is not
special.  
Glad to hear you want to see this implemented, I will start coding on
this then.  Good point with cls_bpf, I was planning to make this port
table strongly connected to XDP, guess I should also think of cls_bpf.  
perfect.
I think we should try to make all additions to bpf networking world
to be usable for both tc and xdp, since both are actively used and
it wouldn't be great to have cool feature for one, but not the other.
I think port table is an excellent candidate that applies to both.  
+1

Jesper, I was working up the code for the redirect piece for ixgbe and
virtio, please use this as a base for your virtual port number table. I'll
push an update onto github tomorrow. I think the table should drop in fairly
nicely.
Cool, I will do that. Then, I'll also have a redirect method to shape
this around, and I would have to benchmark/test your ixgbe redirect.

(John please let me know, what github tree we are talking about, and
what branch)

One piece that isn't clear to me is how do you plan to instantiate and
program this table. Is it a new static bpf map that is created any
time we see the redirect command? I think this would be preferred.
(This is difficult to explain without us misunderstanding each-other)

As Alexei also mentioned before, ifindex vs port makes no real
difference seen from the bpf program side.  It is userspace's
responsibility to add ifindex/port's to the bpf-maps, according to how
the bpf program "policy" want to "connect" these ports.  The
port-table system add one extra step, of also adding this port to the
port-table (which lives inside the kernel). 

When loading the XDP program, we also need to pass along a port table
"id" this XDP program is associated with (and if it doesn't exists you
create it).  And your userspace "control-plane" application also need
to know this port table "id", when adding a new port.

The concept of having multiple port tables is key.  As this implies we
can have several simultaneous "data-planes" that is *isolated* from
each-other.  Think about how network-namespaces/containers want
isolation. A subtle thing I'm afraid to mention, is that oppose to the
ifindex model, a port table with mapping to a net_device pointer, would
allow (faster) delivery into the container's inner net_device, which
sort of violates the isolation, but I would argue it is not a problem
as this net_device pointer could only be added from a process within the
namespace.  I like this feature, but it could easily be disallowed via
port insertion-time validation.

   
quoted
quoted
I'm not worried about the DROP case, I agree that is fine (as you
also say).  The problem is unintentionally sending a packet to a
wrong ifindex.  This is clearly an eBPF program error, BUT with
XDP this becomes a very hard to debug program error.  With
TC-redirect/cls_bpf we can tcpdump the packets, with XDP there is
no visibility into this happening (the NSA is going to love this
"feature").  Maybe we could add yet-another tracepoint to allow
debugging this.  My proposal that we simply remove the possibility
for such program errors, by as you say move the validation from
run-time into static insertion-time, via a port table.  
I think lack of tcpdump-like debugging in xdp is a separate issue.
As I was saying in the other thread we have trivial 'xdpdump'
kern+user app that emits pcap file, but it's too specific to how we
use tail_calls+prog_array in our xdp setup. I'm working on the
program chaining that will be generic and allow us transparently
add multiple xdp or tc progs to the same attachment point and will
allow us to do 'xdpdump' at any point of this pipeline, so
debugging of what happened to the packet will be easier and done in
the same way for both tc and xdp.
btw in our experience working with both tc and xdp the tc+bpf was
actually harder to use and more bug prone.
  
Nice, the tcpdump-like debugging looks interesting.
Yes, this xdpdump sound like a very useful tool.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help