Re: [RFC net-next 08/15] ipxlat: add translation engine and dispatch core

From: Ralf Lici <hidden>
Date: 2026-06-13 13:24:48
Also in: lkml

On Wed, 10 Jun 2026 13:14:47 +0200, Toke Høiland-Jørgensen [off-list ref] wrote:

Ralf Lici [off-list ref] writes:

quoted

Hi Toke,

On Thu, 04 Jun 2026 20:23:51 +0200, Toke Høiland-Jørgensen [off-list ref] wrote:

quoted

Ralf Lici [off-list ref] writes:

quoted

This commit introduces the core start_xmit processing flow: validate,
select action, translate, and forward. It centralizes action resolution
in the dispatch layer and keeps per-direction translation logic separate
from device glue. The result is a single data-path entry point with
explicit control over drop/forward/emit behavior.

Signed-off-by: Ralf Lici <redacted>

This is very cool! Going quickly through the series, this seems like
thorough work that will be cool to have available in the kernel, so
thanks for doing this! I'll be quite happy to retire my barebones
BPF-based implementation once this lands :)

Thanks, glad to hear this looks useful. I have not had much time to work
on ipxlat lately, but I hope to respin the RFC soon.

quoted

One comment on the device model below (which is also why I chose this
patch to reply to):

quoted

+static void ipxlat_forward_pkt(struct ipxlat_priv *ipxlat, struct sk_buff *skb)
+{
+	const unsigned int len = skb->len;
+	int err;
+
+	/* reinject as a fresh packet with scrubbed metadata */
+	skb_set_queue_mapping(skb, 0);
+	skb_scrub_packet(skb, false);
+
+	err = gro_cells_receive(&ipxlat->gro_cells, skb);

So given that you're not resetting skb->dev here, IIUC, this means that
the translated packet will magically re-appear as if it arrived on the
interface it first came in on, right?

That seems... a bit too magical? Sending a packet to one device making
it suddenly reappear on a different, unrelated, device seems like it
will just create confusion. It's like the ipxlat device can't really
device if it's a device or a tunnel? :)

That's not quite what happens in the routed xmit path. There the stack
sets skb->dev to the selected output device before handing the skb to
the device. For IPv4 and IPv6 this happens in ip_output/ip6_output,
where the output device is taken from the skb dst. So when the route
selects the ipxlat device, the skb reaches ndo_start_xmit with skb->dev
already pointing at the ipxlat device, not at the original ingress
device.

The internal 4-to-6 pre-fragmentation path should preserve the same
property as well: ip_do_fragment copies the skb metadata to the
generated fragments, including skb->dev, and the temporary dst used for
that path also points at the ipxlat device. The fragment callback then
feeds those fragments back into the same ipxlat processing path.

That said, I agree that relying on this implicitly is not great.
gro_cells_receive uses skb->dev directly, and the intended receive-side
re-injection model should be obvious at the call site. I will set
skb->dev = ipxlat->dev explicitly before gro_cells_receive in the next
version.

Right, sounds good. I'm also wondering if you actually need the gro_cells
infrastructure at all? IIUC, the purpose of that is to allow tunnels to
create GRO superframes of packets after they are decapsulated (and thus
their l4 commonality becomes apparent). But you're not decapsulating
anything, you're just translating between protocols the kernel already
understands. So presumably any opportunity to coalesce GRO packets would
already have happened pre-translation? So any reason why you can't just
do what loopback.c does, and do a straight __netif_rx() call in the
transmit function?

No, I think you're right that gro_cells is not justified here, I was
probably biased by my work on tunnel interfaces. Unlike a tunnel decap
path, ipxlat does not reveal a new same-family L4 flow after
decapsulation, so I don't see a translation-specific GRO opportunity
there, and a loopback-style receive handoff would be the simpler version
of that design.

That said, after thinking more about the rest of your feedback, I think
the right fix is probably not just replacing gro_cells with __netif_rx.
The deeper issue is the netdevice/RX-reinjection model itself.

quoted

I think a better model is to treat the device as basically a loopback
device that translates packets before looping them back (so when they
come back they appear to be coming from that device).

Any reason why that wouldn't work?

That's indeed the intended model for the ipxlat netdevice: route packets
to it, translate them, then loop them back into the stack as packets
received from that same device. That seemed like the simplest model and
the one that exposes the translation point most clearly.

Right. I think this could be made a bit more explicit in the
documentation as well, since it's a bit of an unusual model.

And, well, taking a step back: is it really the right model? Regular NAT
lives in netfilter, why can't this be a netfilter module as well? Seems
to me you could have something like:

table ip xlat4 {
	chain postrouting {
		type nat hook postrouting priority srcnat; policy accept;
		ip daddr 0.0.0.0/0 oifname "eth0" xlat to 64:ff9b::/96
	}
}
table ip6 xlat6 {
	chain prerouting {
		type nat hook prerouting priority dstnat; policy accept;
		ip6 saddr 64::ff0b::/96 iifname "eth0" xlat from 64::ff9b::/96
	}
}

and that would provide the functionality without having to implement a
new interface type and the associated multiple traversals through the
stack? Did you consider this as an alternative to the new device type?

We did consider netfilter, and your example is syntactically attractive,
but I am no longer convinced it is the cleanest model for SIIT.

An nft expression cannot simply rewrite ETH_P_IP <-> ETH_P_IPV6 and
return ACCEPT as if this were normal NAT because the current hook
invocation, dst, and conntrack-related state were established for the
packet as it entered that hook. A cross-family translator would need to
consume the skb, clear or rebuild route and ct metadata as appropriate,
do an other-family route lookup, and resume at a well-defined point in
that family. That seems possible, but it would be a new stateless
cross-family action, not just a new mode of the existing nft nat
expression (which is built around nf_nat_setup_info and assumes the
packet's L3 family does not change AFAICT).

My second concern is that the SIIT boundary would be a property of rule
and hook placement. That gives flexibility, but it also means the
translation point has to be constrained and documented very carefully to
avoid ambiguous TTL/Hop Limit, PMTU/ICMP, and hook-order behavior. For
this use case I would rather have the route that matches the translation
prefix also be the object that says: leave this family here and continue
in the other one.

After looking at the available kernel mechanisms again, I think the
better model is probably LWT: routes carry an ipxlat encap referencing a
named translator domain configured over netlink. That should represent
the stateless, prefix-based and symmetric nature of ipxlat.

Very roughly, userspace could look like:

    ip xlat add siit0 prefix6 64:ff9b::/96
    ip route add ... encap ipxlat id siit0
    ip -6 route add ... encap ipxlat id siit0

There are some useful precedents for this: ILA is stateless address
translation as LWT, seg6_local already has cross-family LWT actions, and
ioam6 has a similar split between separately configured objects and
route attachments.

The invariant I would like v2 to follow is that the original-family
route lookup selects translation as its terminal route action. The
translated skb then gets a fresh lookup in the other family. From that
point on, TTL/Hop Limit where applicable, PMTU, ICMP errors, and
netfilter visibility belong to the translated family.

So I think your question addresses the core design issue in this RFC. My
current preference is to rework the next version around an LWT/domain
model instead of the virtual netdevice model, unless prototyping shows a
fundamental problem with that approach.

Does that model make sense to you?

Thanks for pushing on this.

-- 
Ralf Lici
Mandelbit Srl

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help