Re: [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

From: Arthur Fabre <hidden>
Date: 2024-09-26 11:31:47
Also in: bpf, intel-wired-lan

On Sun, Sep 22, 2024 at 1:12 PM Toke Høiland-Jørgensen [off-list ref] wrote:

FYI, we also had a discussion related to this at LPC on Friday, in this
session: https://lpc.events/event/18/contributions/1935/

The context here was that Arthur and Jakub want to also support extended
rich metadata all the way through the SKB path, and are looking at the
same area used for XDP metadata to store it. So there's a need to manage
both the kernel's own usage of that area, and userspace/BPF usage of it.

I'll try to summarise some of the points of that discussion (all
interpretations are my own, of course):

- We want something that can be carried with a frame all the way from
  the XDP layer, through all SKB layers and to userspace (to replace the
  use of skb->mark for this purpose).

- We want different applications running on the system (of which the
  kernel itself if one, cf this discussion) to be able to share this
  field, without having to have an out of band registry (like a Github
  repository where applications can agree on which bits to use). Which
  probably means that the kernel needs to be in the loop somehow to
  explicitly allocate space in the metadata area and track offsets.

- Having an explicit API to access this from userspace, without having
  to go through BPF (i.e., a socket- or CMSG-based API) would be useful.

Thanks for looping us in, and the great summary Toke!

The TLV format was one of the suggestions in Arthur and Jakub's talk,
but AFAICT, there was not a lot of enthusiasm about this in the room
(myself included), because of the parsing overhead and complexity. I
believe the alternative that was seen as most favourable was a map
lookup-style API, where applications can request a metadata area of
arbitrary size and get an ID assigned that they can then use to set/get
values in the data path.

So, sketching this out, this could be realised by something like:

/* could be called from BPF, or through netlink or sysfs; may fail, if
 * there is no more space
 */
int metadata_id = register_packet_metadata_field(sizeof(struct my_meta));

The ID is just an opaque identifier that can then be passed to
getter/setter functions (for both SKB and XDP), like:

ret = bpf_set_packet_metadata_field(pkt, metadata_id,
                                    &my_meta_value, sizeof(my_meta_value))

ret = bpf_get_packet_metadata_field(pkt, metadata_id,
                                    &my_meta_value, sizeof(my_meta_value))


On the kernel side, the implementation would track registered fields in
a global structure somewhere, say:

struct pkt_metadata_entry {
  int id;
  u8 sz;
  u8 offset;
  u8 bit;
};

struct pkt_metadata_registry { /* allocated as a system-wide global */
  u8 num_entries;
  u8 total_size;
  struct pkt_metadata_entry entries[MAX_ENTRIES];
};

struct xdp_rx_meta { /* at then end of xdp_frame */
  u8 sz; /* set to pkt_metadata_registry->total_size on alloc */
  u8 fields_set; /* bitmap of fields that have been set, see below */
  u8 data[];
};

int register_packet_metadata_field(u8 size) {
  struct pkt_metadata_registry *reg = get_global_registry();
  struct pkt_metadata_entry *entry;

  if (size + reg->total_size > MAX_METADATA_SIZE)
    return -ENOSPC;

  entry = &reg->entries[reg->num_entries++];
  entry->id = assign_id();
  entry->sz = size;
  entry->offset = reg->total_size;
  entry->bit = reg->num_entries - 1;
  reg->total_size += size;

  return entry->id;
}

int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void
                                  *value, size_t sz)
{
  struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);

  if (!entry)
    return -ENOENT;

  if (entry->sz != sz)
    return -EINVAL; /* user error */

  if (frm->rx_meta.sz < entry->offset + sz)
    return -EFAULT; /* entry allocated after xdp_frame was initialised */

  memcpy(&frm->rx_meta.data + entry->offset, value, sz);
  frm->rx_meta.fields_set |= BIT(entry->bit);

  return 0;
}

int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void
                                  *value, size_t sz)
{
  struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);

  if (!entry)
    return -ENOENT;

  if (entry->sz != sz)
    return -EINVAL;

if (frm->rx_meta.sz < entry->offset + sz)
    return -EFAULT; /* entry allocated after xdp_frame was initialised */

  if (!(frm->rx_meta.fields_set & BIT(entry->bit)))
    return -ENOENT;

  memcpy(value, &frm->rx_meta.data + entry->offset, sz);

  return 0;
}

I'm hinting at some complications here (with the EFAULT return) that
needs to be resolved: there is no guarantee that a given packet will be
in sync with the current status of the registered metadata, so we need
explicit checks for this. If metadata entries are de-registered again
this also means dealing with holes and/or reshuffling the metadata
layout to reuse the released space (incidentally, this is the one place
where a TLV format would have advantages).

The nice thing about an API like this, though, is that it's extensible,
and the kernel itself can be just another consumer of it for the
metadata fields Lorenzo is adding in this series. I.e., we could just
pre-define some IDs for metadata vlan, timestamp etc, and use the same
functions as above from within the kernel to set and get those values;
using the registry, there could even be an option to turn those off if
an application wants more space for its own usage. Or, alternatively, we
could keep the kernel-internal IDs hardcoded and always allocated, and
just use the getter/setter functions as the BPF API for accessing them.

That's exactly what I'm thinking of too, a simple API like:

get(u8 key, u8 len, void *val);
set(u8 key, u8 len, void *val);

With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.

If a NIC doesn't support a certain well-known metadata, the key
wouldn't be set, and get() would return ENOENT.

I think this also lets us avoid having to "register" keys or bits of
metadata with the kernel.
We'd reserve some number of keys for hardware metadata.

The remaining keys would be up to users. They'd have to allocate keys
to services, and configure services to use those keys.
This is similar to the way listening on a certain port works: only one
service can use port 80 or 443, and that can typically beconfigured in
a service's config file.

This side-steps the whole question of how to change the registered
metadata for in-flight packets, and how to deal with different NICs
with different hardware metadata.

I think I've figured out a suitable encoding format, hopefully we'll have an
RFC soon!

-Toke

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help