Re: RFC: very rough draft of a bpf permission model
From: Alexei Starovoitov <hidden>
Date: 2019-08-26 22:36:04
Also in:
bpf, linux-api, netdev
On Fri, Aug 23, 2019 at 04:09:11PM -0700, Andy Lutomirski wrote:
On Thu, Aug 22, 2019 at 4:26 PM Alexei Starovoitov [off-list ref] wrote:quoted
You're proposing all of the above in addition to CAP_BPF, right? Otherwise I don't see how it addresses the use cases I kept explaining for the last few weeks.None of my proposal is intended to exclude changes like CAP_BPF to make privileged bpf() operations need less privilege. But I think it's very hard to evaluate CAP_BPF without both a full description of exactly what CAP_BPF would do and what at least one full example of a user would look like.
the example is previous email and systemd example was not "full" ?
I also think that users who want CAP_BPF should look at manipulating their effective capability set instead. A daemon that wants to use bpf() but otherwise minimize the chance of accidentally causing a problem can use capset() to clear its effective and inheritable masks. Then, each time it wants to call bpf(), it could re-add CAP_SYS_ADMIN or CAP_NET_ADMIN to its effective set, call bpf(), and then clear its effective set again. This works in current kernels and is generally good practice.
Such logic means that CAP_NET_ADMIN is not necessary either. The process could re-add CAP_SYS_ADMIN when it needs to reconfigure network and then drop it.
Aside from this, and depending on exactly what CAP_BPF would be, I have some further concerns. Looking at your example in this email:quoted
Here is another example of use case that CAP_BPF is solving: The daemon X is started by pid=1 and currently runs as root. It loads a bunch of tracing progs and attaches them to kprobes and tracepoints. It also loads cgroup-bpf progs and attaches them to cgroups. All progs are collecting data about the system and logging it for further analysis.This needs more than just bpf(). Creating a perf kprobe event requires CAP_SYS_ADMIN, and without a perf kprobe event, you can't attach a bpf program.
that is already solved sysctl_perf_event_paranoid. CAP_BPF is about BPF part only.
And the privilege to attach bpf programs to cgroups without any DAC or MAC checks (which is what the current API does) is an extremely broad privilege that is not that much weaker than CAP_SYS_ADMIN or CAP_NET_ADMIN. Also:
I don't think there is a hierarchy of CAP_SYS_ADMIN vs CAP_NET_ADMIN vs CAP_BPF. CAP_BPF and CAP_NET_ADMIN carve different areas of CAP_SYS_ADMIN. Just like all other caps.
quoted
This tracing bpf is looking into kernel memory and using bpf_probe_read. Clearly it's not _secure_. But it's _safe_. The system is not going to crash because of BPF, but it can easily crash because of simple coding bugs in the user space bits of that daemon.The BPF verifier and interpreter, taken in isolation, may be extremely safe, but attaching BPF programs to various hooks can easily take down the system, deliberately or by accident. A handler, especially if it can access user memory or otherwise fault, will explode if attached to an inappropriate kprobe, hw_breakpoint, or function entry trace event.
absolutely not true.
(I and the other maintainers consider this to be a bug if it happens, and we'll fix it, but these bugs definitely exist.) A cgroup-bpf hook that blocks all network traffic will effectively kill a machine, especially if it's a server.
this permission is granted by CAP_NET_ADMIN. Nothing changes here.
A bpf program that runs excessively slowly attached to a high-frequency hook will kill the system, too.
not true either.
(I bet a buggy bpf program that calls bpf_probe_read() on an unmapped address repeatedly could be make extremely slow. Page faults take thousands to tens of thousands of cycles.)
kprobe probing and faulting on non-existent address will do the same 'damage'. So it's not bpf related. Also it won't make the system "extremely slow". Nothing to do with CAP_BPF.
A bpf firewall rule that's wrong can cut a machine off from the network -- I've killed machines using iptables more than once, and bpf isn't magically safer.
this is CAP_NET_ADMIN permission. It's a different capability.
I'm wondering if something like CAP_TRACING would make sense. CAP_TRACING would allow operations that can reveal kernel memory and other secret kernel state but that do not, by design, allow modifying system behavior. So, for example, CAP_TRACING would allow privileged perf_event_open() operations and privileged bpf verifier usage. But it would not allow cgroup-bpf unless further restrictions were added, and it would not allow the *_BY_ID operations, as those can modify other users' bpf programs' behavior.
Makes little sense to me. I can imagine CAP_TRACING controlling kprobe/uprobe creation and probe_read() both from bpf side and from vanilla kprobe. That would be much nicer interface to use than existing sysctl_perf_event_paranoid, but that is orthogonal to CAP_BPF which is strictly about BPF.
Something finer-grained can mitigate some of this. CAP_BPF as I think you're imagining it will not.
I'm afraid this discussion goes nowhere. We'll post CAP_BPF patches soon so we can discuss code.