[PATCH v1 bpf-next 0/2] bpf: bpf_redirect_peer egress redirection
From: Jordan Rife <hidden>
Date: 2026-06-13 18:34:34
Also in:
bpf
We have several use cases where a pod injects traffic into the datapath
of another so that the traffic appears to have originated from that
pod. One such use case is a synthetic flow generator which injects
synthetic traffic into a pod's datapath to enable dynamic probing and
debugging. Another is a transparent proxy where connections originating
from one pod are redirected towards another which proxies that
connection. The new connection is bound to the IP of the original pod
using IP_TRANSPARENT and its traffic is injected into that pod's
datapath and handled as if it had originated there. This can be used for
mTLS, etc.
We use bpf_redirect(BPF_F_INGRESS) to direct traffic leaving the proxy,
flow generator, etc. towards the target pod, ensuring that eBPF programs
that are meant to intercept traffic leaving that pod are executed.
However, this doesn't work with netkit.
With netkit, an ingress redirection from proxy to workload skips eBPF
programs that are meant to intercept traffic leaving the pod, since they
reside on the netkit peer device. One workaround is to attach the
same program to both the netkit peer device and the TCX ingress hook for
the netkit pair's primary interface, but
a) This seems hacky and we need to be careful not to run the same
program twice for the same skb in cases where we want to pass that
traffic to the host stack.
b) We're trying to keep the proxy redirection / traffic injection
systems as modular and separated from Cilium as possible, the system
that manages netkit setup and core eBPF programming.
It would be handy if instead we could redirect traffic directly from
one netkit peer device to another. This patch proposes an extension
to bpf_redirect_peer to allow us to do just that.
With this patch, the BPF_F_INGRESS flag tells bpf_redirect_peer to emit
the skb in the egress direction of the target interface's peer device
While the main use case is netkit, I suppose you could also use this
mode with veth as well if, e.g., there were some eBPF programs attached
to that side of the veth pair that needed to intercept traffic.
+---------------------------------------------------------------------+
| +-------------------------+ 6. bpf_redirect_neigh(eth0) |
| | pod (10.244.0.10) | ------------------------ |
| | | | | |
| | +--------+ | | +---------+ | |
| | 1. packet -->| | | | | | | |
| | leaves ^ | netkit |<===========|======| netkit | | |
| | | | peer |=======(eBPF)=====>| primary | | |
| | | | | | | | | | |
| | | +--------+ | | +---------+ | |
| | | | | 2. bpf_redirect v |
| +-----------|-------------+ |___________________ +-------|
| | | | eth0 |
| | 5. bpf_redirect_peer(BPF_F_INGRESS) | +-------|
| |________________________ | |
| +-------------------------+ | | |
| | proxy (10.244.0.11) | | | |
| | IP_TRANSPARENT | | | |
| | +--------+ | | +---------+ | |
| | 3. packet <--| | | | | |<-- |
| | enters | netkit |<===========|======| netkit | |
| | [proxy] | peer |=======(eBPF)=====>| primary | |
| | 4. packet -->| | | | | |
| | leaves +--------+ | +---------+ |
| | sip=10.244.0.10 | |
| +-------------------------+ |
+---------------------------------------------------------------------+
Using the proxy use case as an example, in step 5 we would redirect
traffic leaving the proxy towards the pod's peer device using
bpf_redirect_peer(BPF_F_INGRESS).
As a bonus, since the skb doesn't have to go through the backlog queue
it can take full advantage of netkit's performance benefits. I set up a
test where outgoing iperf3 traffic is injected into the datapath of
another pod using either bpf_redirect_peer(BPF_F_INGRESS) or
bpf_redirect(BPF_F_INGRESS). I used Cilium's eBPF host routing mode
which skips the host stack and uses BPF redirect helpers to do all the
routing.
(net.ipv4.tcp_congestion_control=cubic,mtu=1500,100GiB link,Cilium
eBPF host routing mode)
BASELINE [bpf_redirect(BPF_F_INGRESS)]
1. [iperf pod] ==bpf_redirect([pod b], BPF_F_INGRESS)==> [pod b]
2. [pod b] ==bpf_redirect_neigh([eth0])==> eth0
3. eth0 ==over network==> [host b]
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 231 GBytes 33.0 Gbits/sec 12060 sender
[ 5] 0.00-60.00 sec 230 GBytes 33.0 Gbits/sec receiver
TEST [bpf_redirect_peer(BPF_F_INGRESS)]
1. [iperf pod] ==bpf_redirect_peer([pod b], BPF_F_INGRESS)==> [pod b]
2. [pod b] ==bpf_redirect_neigh([eth0])==> eth0
3. eth0 ==over network==> [host b]
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 272 GBytes 38.9 Gbits/sec 0 sender
[ 5] 0.00-60.00 sec 272 GBytes 38.9 Gbits/sec receiver
In this test, using bpf_redirect_peer(BPF_F_INGRESS) for the hop from
[iperf pod] to [pod b] led to ~18% more throughput compared to
bpf_redirect(BPF_F_INGRESS).
Note: I wasn't sure about the flag name. I can see where BPF_F_INGRESS
might be confusing, since technically it's an egress redirection
from the perspective of the peer device's namespace. But, I didn't
want to add a BPF_F_EGRESS flag just for this and convinced myself
it makes sense, because from the perspective of the caller the skb
will be flowing towards the current namespace.
Jordan Rife (2):
bpf: Support BPF_F_INGRESS with bpf_redirect_peer
selftests/bpf: Add tests for bpf_redirect_peer with BPF_F_INGRESS
include/uapi/linux/bpf.h | 16 +++--
net/core/filter.c | 14 ++--
tools/include/uapi/linux/bpf.h | 16 +++--
.../selftests/bpf/prog_tests/tc_redirect.c | 68 +++++++++++++++++++
.../selftests/bpf/progs/test_tc_peer.c | 22 ++++++
5 files changed, 116 insertions(+), 20 deletions(-)
--
2.43.0