Re: [PATCH bpf-next v3 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS

From: Martin Lau <hidden>
Date: 2020-01-08 18:42:20
Also in: bpf

On Wed, Jan 08, 2020 at 05:53:50PM +0100, Daniel Borkmann wrote:

On Wed, Jan 08, 2020 at 01:52:26AM +0000, Martin Lau wrote:

quoted

On Wed, Jan 08, 2020 at 01:21:39AM +0100, Daniel Borkmann wrote:

quoted

On 12/31/19 7:20 AM, Martin KaFai Lau wrote:

quoted

The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
is a kernel struct with its func ptr implemented in bpf prog.
This new map is the interface to register/unregister/introspect
a bpf implemented kernel struct.

The kernel struct is actually embedded inside another new struct
(or called the "value" struct in the code).  For example,
"struct tcp_congestion_ops" is embbeded in:
struct bpf_struct_ops_tcp_congestion_ops {
	refcount_t refcnt;
	enum bpf_struct_ops_state state;
	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
}
The map value is "struct bpf_struct_ops_tcp_congestion_ops".
The "bpftool map dump" will then be able to show the
state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
is created automatically by a macro.  Having a separate "value" struct
will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
"void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
initialization works before registering the struct_ops to the kernel
subsystem).  The libbpf will take care of finding and populating the
"struct bpf_struct_ops_XYZ" from "struct XYZ".

Register a struct_ops to a kernel subsystem:
1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
    set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
    running kernel.
    Instead of reusing the attr->btf_value_type_id,
    btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
    used as the "user" btf which could store other useful sysadmin/debug
    info that may be introduced in the furture,
    e.g. creation-date/compiler-details/map-creator...etc.
3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
    in the running kernel btf.  Populate the value of this object.
    The function ptr should be populated with the prog fds.
4. Call BPF_MAP_UPDATE with the object created in (3) as
    the map value.  The key is always "0".

During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
the specific struct_ops to do some final checks in "st_ops->init_member()"
(e.g. ensure all mandatory func ptrs are implemented).
If everything looks good, it will register this kernel struct
to the kernel subsystem.  The map will not allow further update
from this point.

Btw, did you have any thoughts on whether it would have made sense to add
a new core construct for BPF aside from progs or maps, e.g. BPF modules
which then resemble a collection of progs/ops (given this would not be limited
to tcp congestion control only). Given the possibilities, having a bit of second
thoughts on abusing BPF map interface this way which is not overly pretty. It's
not a map anymore at this point anyway, we're just reusing the syscall interface
since it's convenient though cannot be linked to any prog is just a single slot
etc, but technically some sort of BPF module registration would be nicer. Also in
terms of 'bpftool modules' then listing all such currently loaded modules which
need to be cleaned up this way through explicit removal (similar to insmod/
lsmod/rmmod); at least feels more natural conceptually than BPF maps and the way
you refcount them, and would perhaps also be a fit for BPF lib helpers for dynamic
linking to load that way. So essentially similar but more lightweight infrastructure
as with kernel modules. Thoughts?

Inventing a new bpf obj type (vs adding new map type like in this patch) was
one considered (and briefly-tried) option.

Once BTF was introduced to bpf map,  I see bpf map as an introspectible
bpf obj that can store any blob described by BTF.  I don't think
creating a new bpf obj type worth it while both of them are basically
storing a value described by BTF.

I did try to create register/unregister interface and new bpf-cmd.
At the end, it ends up very similar to update_elem() which is basically
updating a blob of a struct described by BTF.  Hence, I tossed that and
came back to the current approach.

Put aside the new bpf obj type needs kernel support like another idr,
likely pin-able, fd, get_info...etc,  I suspect most users have already
been used to do 'bpftool map dump' to introspect bpf obj that is storing
a 'struct'.

The map type is enough to distinguish the map usage instead of creating
another bpf obj type.  The 'bpftool modules' will work on the struct_ops
map only.

Right, but under long-term I'd expect more users of this interface and given
we abuse the map only to keep other entities (here: bpf tcp congctl module)
'alive', but cannot do anything else with this map (as in: usage in the BPF
program),

For now, yes.  In the future, a bpf_prog may want to switch to another
bpf-tcp-cc (could be by looking it up from map-in-map also).  I do not
mean there is an immediate usecase but it is good to keep this
flexibility.

it feels that this begs for a better interface. Given we need an
explicit delete operation of the map slot in order to eventually unregister
the congctl module once no application is using it anymore, how are users
supposed to operate this considering the loader performs either only a load
or crashes before the map delete happens? If you had 'bpftool modules' like
cmdline interface with similar insmod/lsmod/rmmod type operation as we have
for kernel modules, it's pretty obvious and intuitive. Here, you'd need a
'bpftool map dump' to get to the concrete ops map, and then perform an
explicit delete operation for releasing the ops refcount and thus to unload
the set of progs. Such extension for bpftool should be done regardless, even

A new bpftool command to operate on struct_ops map alone is in the pipeline.

The first thing though is to improve bpftool to recognize
btf_vmlinux_value_type_id which could be useful in the future
maps that also store a kernel's struct.

Regarding 'bpftool map show' first to figure out which
'struct_ops' map to delete,  the same is also true for lsmod/rmmod.
I also usually do lsmod to figure out which one I am looking at first
before issuing rmmod.  I suspect even the same lookup and then
delete/rmmod operation will still have to be done for the future
'bpftool modules (or struct_ops)' command.

if we end up to keep abusing the map interface for this, but API wise feels
way cleaner to have a dedicated register/unregister interface.

Other than the BPF syscall command name difference, lets explore how
would register/unregister be different from update/delete.

The first attempt I did on BPF_STRUCT_OPS_REGISTER is to do update alone
which ends-up very close to BPF_MAP_UPDATE_ELEM.

The second attempt I did on register is to do map-create and map-update
together and then return a fd.  However, I still don't see enough
benefit that deserves a separate BPF command to just combine these
two.  The global .rodata map is also map-create, map-update,
and map-freeze which technically it can do all of them under
a new command.

If update-vs-register looks the same, it then logically follows
update-then-delete.  For BPF_STRUCT_OPS_UNREGISTER, I also do not
see any difference except the key can be avoided.  Also, the
struct_ops map has a btf_key_type_id 0 which is "void" and
it is a clean interface to tell the map is key-less.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help