Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Song Liu <hidden>
Date: 2019-08-05 07:37:18
Also in:
bpf, linux-api, netdev
Hi Andy,
On Aug 4, 2019, at 10:47 PM, Andy Lutomirski [off-list ref] wrote: On Sun, Aug 4, 2019 at 5:08 PM Andy Lutomirski [off-list ref] wrote:quoted
On Sun, Aug 4, 2019 at 3:16 PM Andy Lutomirski [off-list ref] wrote:quoted
On Fri, Aug 2, 2019 at 12:22 AM Song Liu [off-list ref] wrote:quoted
Hi Andy,quoted
I actually agree CAP_BPF_ADMIN makes sense. The hard part is to makequoted
existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.It's been done before -- it's not that hard. IMO the main tricky bit would be try be somewhat careful about defining exactly what CAP_BPF_ADMIN does.Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the Plumbers conference. OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons like systemd to do sys_bpf() without root.I don't understand the use case here. Are you talking about systemd --user? As far as I know, a user is expected to be able to fully control their systemd --user process, so giving it unrestricted bpf access is very close to giving it superuser access, and this doesn't sound like a good idea. I think that, if systemd --user needs bpf(), it either needs real unprivileged bpf() or it needs a privileged helper (SUID or a daemon) to intermediate this access.quoted
quoted
quoted
quoted
I don't see why you need to invent a whole new mechanism for this. The entire cgroup ecosystem outside bpf() does just fine using the write permission on files in cgroupfs to control access. Why can't bpf() do the same thing?It is easier to use write permission for BPF_PROG_ATTACH. But it is not easy to do the same for other bpf commands: BPF_PROG_LOAD and BPF_MAP_*. A lot of these commands don't have target concept. Maybe we should have target concept for all these commands. But that is a much bigger project. OTOH, "all or nothing" model allows all these commands at once.For BPF_PROG_LOAD, I admit I've never understood why permission is required at all. I think that CAP_SYS_ADMIN or similar should be needed to get is_priv in the verifier, but I think that should mainly be useful for tracing, and that requires lots of privilege anyway. BPF_MAP_* is probably the trickiest part. One solution would be some kind of bpffs, but I'm sure other solutions are possible.Improving permission management of cgroup_bpf is another good topic to discuss. However, it is also an overkill for current use case.I looked at the code some more, and I don't think this is so hard after all. As I understand it, all of the map..by_id stuff is, to some extent, deprecated in favor of persistent maps. As I see it, the map..by_id calls should require privilege forever, although I can imagine ways to scope that privilege to a namespace if the maps themselves were to be scoped to a namespace. Instead, unprivileged tools would use the persistent map interface roughly like this: $ bpftool map create /sys/fs/bpf/my_dir/filename type hash key 8 value 8 entries 64 name mapname This would require that the caller have either CAP_DAC_OVERRIDE or that the caller have permission to create files in /sys/fs/bpf/my_dir (using the same rules as for any filesystem), and the resulting map would end up owned by the creating user and have mode 0600 (or maybe 0666, or maybe a new bpf_attr parameter) modified by umask. Then all the various capable() checks that are currently involved in accessing a persistent map would instead check FMODE_READ or FMODE_WRITE on the map file as appropriate. Half of this stuff already works. I just set my system up like this: $ ls -l /sys/fs/bpf total 0 drwxr-xr-x. 3 luto luto 0 Aug 4 15:10 luto $ mkdir /sys/fs/bpf/luto/test $ ls -l /sys/fs/bpf/luto total 0 drwxrwxr-x. 2 luto luto 0 Aug 4 15:10 test I bet that making the bpf() syscalls work appropriately in this context without privilege would only be a couple of hours of work. The hard work, creating bpffs and making it function, is already done :) P.S. The docs for bpftool create are less than fantastic. The complete lack of any error message at all when the syscall returns -EACCES is also not fantastic.This isn't remotely finished, but I spent a bit of time fiddling with this: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=bpf/perms What do you think? (It's obviously not done. It doesn't compile, and I haven't gotten to the permissions needed to do map operations. I also haven't touched the capable() checks.)I updated the branch. It compiles, and basic map functionality works!
Thanks a lot for trying this out. This is a very interesting direction that we will explore.
# mount -t bpf bpf /sys/fs/bpf # cd /sys/fs/bpf # mkdir luto # chown luto: luto # setpriv --euid=1000 --ruid=1000 bash $ pwd /sys/fs/bpf bash-5.0$ ls -l total 0 drwxr-xr-x 2 luto luto 0 Aug 4 22:41 luto bash-5.0$ bpftool map create /sys/fs/bpf/luto/filename type hash key 8 value 8 entries 64 name mapname bash-5.0$ bpftool map dump pinned /sys/fs/bpf/luto/filename Found 0 elements # chown root: /sys/fs/bpf/luto/filename $ bpftool map dump pinned /sys/fs/bpf/luto/filename Error: bpf obj get (/sys/fs/bpf/luto): Permission denied So I think it's possible to get a respectable subset of bpf() functionality working without privilege in short order :)
I think we have two key questions to answer: 1. What subset of bpf() functionality will the users need? 2. Who are the users? Different answers to these two questions lead to different directions. In our use case, the answers are 1) almost all bpf() functionality 2) highly trusted users (sudoers) So our initial approach of /dev/bpf allows all bpf() functionality in one bit in task_struct. (Yes, we can just sudo. But, we would rather not use sudo when possible.) "cgroup management" use case may have answers like: 1) cgroup_bpf only 2) users in their own containers For this case, getting cgroup_bpf related features (cgroup_bpf progs; some map types, etc.) work with unprivileged users would be the right direction. "USDT tracing" use case may have answers like: 1) uprobe, stockmap, histogram, etc. 2) unprivileged user, w/ or w/o containers For this case, the first step is likely hacking sys_perf_event_open(). I guess we will need more discussions to decide how to make bpf() work better for all these (and more) use cases. Thanks, Song