Thread (39 messages) 39 messages, 6 authors, 2025-08-13

Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets

From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: 2025-07-09 09:31:27
Also in: bpf

On Jul 07, Stanislav Fomichev wrote:
On 07/03, Jesper Dangaard Brouer wrote:
quoted

On 02/07/2025 18.05, Stanislav Fomichev wrote:
quoted
On 07/02, Jesper Dangaard Brouer wrote:
quoted
This patch series introduces a mechanism for an XDP program to store RX
metadata hints - specifically rx_hash, rx_vlan_tag, and rx_timestamp -
into the xdp_frame. These stored hints are then used to populate the
corresponding fields in the SKB that is created from the xdp_frame
following an XDP_REDIRECT.

The chosen RX metadata hints intentionally map to the existing NIC
hardware metadata that can be read via kfuncs [1]. While this design
allows a BPF program to read and propagate existing hardware hints, our
primary motivation is to enable setting custom values. This is important
for use cases where the hardware-provided information is insufficient or
needs to be calculated based on packet contents unavailable to the
hardware.

The primary motivation for this feature is to enable scalable load
balancing of encapsulated tunnel traffic at the XDP layer. When tunnelled
packets (e.g., IPsec, GRE) are redirected via cpumap or to a veth device,
the networking stack later calculates a software hash based on the outer
headers. For a single tunnel, these outer headers are often identical,
causing all packets to be assigned the same hash. This collapses all
traffic onto a single RX queue, creating a performance bottleneck and
defeating receive-side scaling (RSS).

Our immediate use case involves load balancing IPsec traffic. For such
tunnelled traffic, any hardware-provided RX hash is calculated on the
outer headers and is therefore incorrect for distributing inner flows.
There is no reason to read the existing value, as it must be recalculated.
In our XDP program, we perform a partial decryption to access the inner
headers and calculate a new load-balancing hash, which provides better
flow distribution. However, without this patch set, there is no way to
persist this new hash for the network stack to use post-redirect.

This series solves the problem by introducing new BPF kfuncs that allow an
XDP program to write e.g. the hash value into the xdp_frame. The
__xdp_build_skb_from_frame() function is modified to use this stored value
to set skb->hash on the newly created SKB. As a result, the veth driver's
queue selection logic uses the BPF-supplied hash, achieving proper
traffic distribution across multiple CPU cores. This also ensures that
consumers, like the GRO engine, can operate effectively.

We considered XDP traits as an alternative to adding static members to
struct xdp_frame. Given the immediate need for this functionality and the
current development status of traits, we believe this approach is a
pragmatic solution. We are open to migrating to a traits-based
implementation if and when they become a generally accepted mechanism for
such extensions.

[1] https://docs.kernel.org/networking/xdp-rx-metadata.html
---
V1: https://lore.kernel.org/all/174897271826.1677018.9096866882347745168.stgit@firesoul/ (local)
No change log?
We have fixed selftest as requested by Alexie.
And we have updated cover-letter and doc as you Stanislav requested.
quoted
Btw, any feedback on the following from v1?
- https://lore.kernel.org/netdev/aFHUd98juIU4Rr9J@mini-arch/ (local)
Addressed as updated cover-letter and documentation. I hope this helps
reviewers understand the use-case, as the discussion turn into "how do we
transfer all HW metadata", which is NOT what we want (and a waste of
precious cycles).

For our use-case, it doesn't make sense to "transfer all HW metadata".
In fact we don't even want to read the hardware RH-hash, because we already
know it is wrong (for tunnels), we just want to override the RX-hash used at
SKB creation.  We do want the BPF programmers flexibility to call these
kfuncs individually (when relevant).
quoted
- https://lore.kernel.org/netdev/20250616145523.63bd2577@kernel.org/ (local)
I feel pressured into critiquing Jakub's suggestion, hope this is not too
harsh.  First of all it is not relevant to our this patchset use-case, as it
focus on all HW metadata.
[..]
quoted
Second, I disagree with the idea/mental model of storing in a
"driver-specific format". The current implementation of driver-specific
kfunc helpers that "get the metadata" is already doing a conversion to a
common format, because the BPF-programmer naturally needs this to be the
same across drivers.  Thus, it doesn't make sense to store it back in a
"driver-specific format", as that just complicate things.  My mental model
is thus, that after the driver-specific "get" operation to result is in a
common format, that is simply defined by the struct type of the kfunc, which
is both known by the kernel and BPF-prog.
Having get/set model seems a bit more generic, no? Potentially giving us the
ability to "correct" HW metadata for the non-redirected cases as well.
Plus we don't hard-code the (internal) layout. Solving only xdp_redirect
seems a bit too narrow, idk..
I can't see what the non-redirected use-case could be. Can you please provide
more details?
Moreover, can it be solved without storing the rx_hash (or the other
hw-metadata) in a non-driver specific format?
Storing the hw-metadata in some of hw-specific format in xdp_frame will not
allow to consume them directly building the skb and we will require to decode
them again. What is the upside/use-case of this approach? (not considering the
orthogonality with the get method).

Regards,
Lorenzo

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help