Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction

From: Toke Høiland-Jørgensen <hidden>
Date: 2020-03-09 09:42:43
Also in: bpf

Daniel Borkmann [off-list ref] writes:

On 3/6/20 9:31 AM, Toke Høiland-Jørgensen wrote:

quoted

Daniel Borkmann [off-list ref] writes:

quoted

On 3/5/20 11:50 PM, Alexei Starovoitov wrote:

quoted

On Thu, Mar 05, 2020 at 11:34:18PM +0100, Daniel Borkmann wrote:

quoted

On 3/5/20 5:34 PM, Alexei Starovoitov wrote:

quoted

On Thu, Mar 05, 2020 at 11:37:11AM +0100, Toke Høiland-Jørgensen wrote:

quoted

Alexei Starovoitov [off-list ref] writes:

quoted

On Wed, Mar 04, 2020 at 08:47:44AM +0100, Toke Høiland-Jørgensen wrote:

[...]

quoted

Anyway, what I was trying to express:

quoted

Still that doesn't mean that pinned link is 'immutable'.

I don't mean 'immutable' in the sense that it cannot be removed ever.
Just that we may end up in a situation where an application can see a
netdev with an XDP program attached, has the right privileges to modify
it, but can't because it can't find the pinned bpf_link. Right? Or am I
misunderstanding your proposal?

Amending my example from before, this could happen by:

1. Someone attaches a program to eth0, and pins the bpf_link to
      /sys/fs/bpf/myprog

2. eth0 is moved to a different namespace which mounts a new sysfs at
      /sys

3. Inside that namespace, /sys/fs/bpf/myprog is no longer accessible, so
      xdp-loader can't get access to the original bpf_link; but the XDP
      program is still attached to eth0.

The key to decide is whether moving netdev across netns should be allowed
when xdp attached. I think it should be denied. Even when legacy xdp
program is attached, since it will confuse user space managing part.

There are perfectly valid use cases where this is done already today (minus
bpf_link), for example, consider an orchestrator that is setting up the BPF
program on the device, moving to the newly created application pod during
the CNI call in k8s, such that the new pod does not have the /sys/fs/bpf/
mount instance and if unprivileged cannot remove the BPF prog from the dev
either. We do something like this in case of ipvlan, meaning, we attach a
rootlet prog that calls into single slot of a tail call map, move it to the
application pod, and only out of Cilium's own pod and it's pod-local bpf fs
instance we manage the pinned tail call map to update the main programs in
that single slot w/o having to switch any netns later on.

Right. You mentioned this use case before, but I managed to forget about it.
Totally makes sense for prog to stay attached to netdev when it's moved.
I think pod manager would also prefer that pod is not able to replace
xdp prog from inside the container. It sounds to me that steps 1,2,3 above
is exactly the desired behavior. Otherwise what stops some application
that started in a pod to override it?

Generally, yes, and it shouldn't need to care nor see what is happening in
/sys/fs/bpf/ from the orchestrator at least (or could potentially have its
own private mount under /sys/fs/bpf/ or elsewhere). Ideally, the behavior
should be that orchestrator does all the setup out of its own namespace,
then moves everything over to the newly created target namespace and e.g.
only if the pod has the capable(cap_sys_admin) permissions, it could mess
around with anything attached there, or via similar model as done in [0]
when there is a master device.

Yup, I can see how this can be a reasonable use case where you *would*
want the locking. However, my concern is that there should be a way for
an admin to recover from this (say, if it happens by mistake, or a
misbehaving application). Otherwise, I fear we'll end up with support
cases where the only answer is "try rebooting", which is obviously not
ideal.

I'm not quite sure I follow the concern, if you're an admin and have
the right permissions, then you should be able to introspect and
change settings like with anything else there is today.

Well, that's what I want to make sure of :)

However, I don't think such introspection is possible today? Or at least
there's no API exposed to do this, you'll have to go write drgn scripts
or something. But I expect an admin will want a command like 'xdp unload
eth0 --yes-i-really-really-mean-this', which would override any locking
done by bpf_link. So how to implement that? It's not enough to learn
'this bpf_link is pinned at links/id-123 on bpffs', you'll also need to
learn on *which* bpffs, and where to find that mountpoint. So how do you
do that?

Whereas an API that says 'return the bpf_link currently attached to
ifindex X' would sidestep this issue; but then, that is basically the
netlink API we have already, except it doesn't have the bpf_link
abstraction... So why do we need bpf_link?

quoted

Last time I looked, there is a down/up cycle on the dev upon netns
migration and it flushes e.g. attached qdiscs afaik, so there are
limitations that still need to be addressed. Not sure atm if same is
happening to XDP right now.

XDP programs will stay attached. devmaps (for redirect) have a notifier
that will remove devices when they move out of a namespace. Not sure if
there are any other issues with netns moves somewhere.

quoted

In this regards veth devices are a bit nicer to work with since
everything can be attached on hostns ingress w/o needing to worry on
the peer dev in the pod's netns.

Presumably the XDP EGRESS hook that David Ahern is working on will make
this doable for XDP on veth as well?

I'm not sure I see a use-case for XDP egress for Cilium yet, but maybe
I'm still lacking a clear picture on why one should use it. We
currently use various layers where we orchestrate our BPF programs
from the agent. XDP/rx on the phys nic on the one end, BPF sock progs
attached to cgroups on the other end of the spectrum. The processing
in between on virtual devices is mainly tc/BPF-based since everything
is skb based anyway and more flexible also with interaction with the
rest of the stack. There is also not this pain of having to linearize
all the skbs, but at least there's a path to tackle that.

I agree that there's not really any reason to use XDP on veth as long as
you'll end up with an skb eventually anyway. The only reason to do
something different I can think of is if you have an application inside
a container using AF_XDP, and you want to carry XDP frames all the way
through to that without ever building an skb. Not really sure this is a
good deployment model, but I kinda suspect that the NFV guys will want
to do this eventually...

-Toke

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help