Thread (45 messages) 45 messages, 8 authors, 2024-10-08

Re: [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: 2024-09-27 15:06:58
Also in: bpf, intel-wired-lan

On Sep 27, Arthur Fabre wrote:
On Fri Sep 27, 2024 at 12:24 PM CEST, Toke Høiland-Jørgensen wrote:
quoted
"Arthur Fabre" [off-list ref] writes:
quoted
quoted
quoted
quoted
The nice thing about an API like this, though, is that it's extensible,
and the kernel itself can be just another consumer of it for the
metadata fields Lorenzo is adding in this series. I.e., we could just
pre-define some IDs for metadata vlan, timestamp etc, and use the same
functions as above from within the kernel to set and get those values;
using the registry, there could even be an option to turn those off if
an application wants more space for its own usage. Or, alternatively, we
could keep the kernel-internal IDs hardcoded and always allocated, and
just use the getter/setter functions as the BPF API for accessing them.
That's exactly what I'm thinking of too, a simple API like:

get(u8 key, u8 len, void *val);
set(u8 key, u8 len, void *val);

With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.

If a NIC doesn't support a certain well-known metadata, the key
wouldn't be set, and get() would return ENOENT.

I think this also lets us avoid having to "register" keys or bits of
metadata with the kernel.
We'd reserve some number of keys for hardware metadata.
Right, but how do you allocate space/offset for each key without an
explicit allocation step? You'd basically have to encode the list of IDs
in the metadata area itself, which implies a TLV format that you have to
walk on every access? The registry idea in my example above was
basically to avoid that...
I've been playing around with having a small fixed header at the front
of the metadata itself, that lets you access values without walking them
all.

Still WIP, and maybe this is too restrictive, but it lets you encode 64
2, 4, or 8 byte KVs with a single 16 byte header:
Ohh, that's clever, I like it! :)

It's also extensible in the sense that the internal representation can
change without impacting the API, so if we end up needing more bits we
can always add those.

Maybe it would be a good idea to make the 'key' parameter a larger
integer type (function parameters are always 64-bit anyway, so might as
well go all the way up to u64)? That way we can use higher values for
the kernel-reserved types instead of reserving part of the already-small
key space for applications (assuming the kernel-internal data is stored
somewhere else, like in a static struct as in Lorenzo's patch)?
Good idea! That makes it more extensible too if we ever support more
keys or bigger lengths.

I'm not sure where the kernel-reserved types should live. Putting them
in here uses up some the of KV IDs, but it uses less head room space than 
always reserving a static struct for them.
Maybe it doesn't matter much, as long as we use the same API to access
them, we could internally switch between one and the other.
quoted
[...]
quoted
quoted
quoted
The remaining keys would be up to users. They'd have to allocate keys
to services, and configure services to use those keys.
This is similar to the way listening on a certain port works: only one
service can use port 80 or 443, and that can typically beconfigured in
a service's config file.
Right, well, port numbers *do* actually have an out of band service
registry (IANA), which I thought was what we wanted to avoid? ;)
Depends how you think about it ;)

I think we should avoid a global registry. But having a registry per
deployment / server doesn't seem awful. Services that want to use a
field would have a config file setting to set which index it corresponds
to.
Admins would just have to pick a free one on their system, and set it in
the config file of the service.

This is similar to what we do for non-IANA registered ports internally.
For example each service needs a port on an internal interface to expose
metrics, and we just track which ports are taken in our config
management.
Right, this would work, but it assumes that applications are
well-behaved and do this correctly. Which they probably do in a
centrally-managed system like yours, but for random applications shipped
by distros, I'm not sure if it's going to work.

In fact, it's more or less the situation we have with skb->mark today,
isn't it? I.e., applications can co-exist as long as they don't use the
same bits, so they have to coordinate on which bits to use. Sure, with
this scheme there will be more total bits to use, which can lessen the
pressure somewhat, but the basic problem remains. In other words, I
worry that in practice we will end up with another github repository
serving as a registry for metadata keys...
That's true. If applications hardcode keys, we'll be back to having
conflicts. If someone creates a registry on github I'll be very sad.

(Maybe we can make the verifier enforce that the key passed to get() and
set() isn't a constant? - only joking)

Applications don't tend to do this for ports though, I think most can be
configured to listen on any port. Is that just because it's been
instilled as "good practice" over time? Could we try to do the same with
some very stern documentation and examples?

Thinking about it more, my only relectance for a registration API is how
to communicate the ID back to other consumers (our discussion below).
quoted
quoted
Dynamically registering fields means you have to share the returned ID
with whoever is interested, which sounds tricky.
If an XDP program sets a field like packet_id, every tracing
program that looks at it, and userspace service, would need to know what
the ID of that field is.
Is there a way to easily share that ID with all of them?
Right, so I'll admit this was one of the handwavy bits of my original
proposal, but I don't think it's unsolvable. You could do something like
(once, on application initialisation):

__u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */
bpf_map_update(&shared_application_config, &my_key_index, &my_key);

and then just get the key out of that map from all programs that want to
use it?
Passing it out of band works (whether it's via a pinned map like you
described, or through other means like a unix socket or some other
API), it's just more complicated.

Every consumer also needs to know about that API. That won't work with
standard tools. For example if we set a PACKET_ID KV, maybe we could
give it to pwru so it could track packets using it?
Without registering keys, we could pass it as a cli flag. With
registration, we'd have to have some helper to get the KV ID.

It also introduces ordering dependencies between the services on
startup, eg packet tracing hooks could only be attached once our XDP
service has registered a PACKET_ID KV, and they could query it's API.
quoted
We could combine such a registration API with your header format, so
that the registration just becomes a way of allocating one of the keys
from 0-63 (and the registry just becomes a global copy of the header).
This would basically amount to moving the "service config file" into the
kernel, since that seems to be the only common denominator we can rely
on between BPF applications (as all attempts to write a common daemon
for BPF management have shown).
That sounds reasonable. And I guess we'd have set() check the global
registry to enforce that the key has been registered beforehand?
quoted
-Toke
Thanks for all the feedback!
I like this 'fast' KV approach but I guess we should really evaluate its
impact on performances (especially for xdp) since, based on the kfunc calls
order in the ebpf program, we can have one or multiple memmove/memcpy for
each packet, right?

Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not
so suitable for nic hw metadata since:
- it grows backward 
- it is probably in a different cacheline with respect to xdp_frame
- nic hw metadata will not start at fixed and immutable address, but it depends
  on the running ebpf program

What about having something like:
- fixed hw nic metadata: just after xdp_frame struct (or if you want at the end
  of the metadata area :)). Here he can reuse the same KV approach if it is fast
- user defined metadata: in the metadata area of the xdp_frame/xdp_buff

Regards,
Lorenzo

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help