Re: [PATCH bpf-next v2 0/8] Support defragmenting IPv(4|6) packets in BPF

From: Daniel Xu <hidden>
Date: 2023-02-27 22:04:15
Also in: bpf, linux-doc, linux-kselftest, lkml

Hi Ed,

Thanks for giving this a look.

On Mon, Feb 27, 2023 at 08:38:41PM +0000, Edward Cree wrote:

On 27/02/2023 19:51, Daniel Xu wrote:

quoted

However, when policy is enforced through BPF, the prog is run before the
kernel reassembles fragmented packets. This leaves BPF developers in a
awkward place: implement reassembly (possibly poorly) or use a stateless
method as described above.

Just out of curiosity - what stops BPF progs using the middle ground of
 stateful validation?  I'm thinking of something like:
First-frag: run the usual checks on L4 headers etc, if we PASS then save
 IPID and maybe expected next frag-offset into a map.  But don't try to
 stash the packet contents anywhere for later reassembly, just PASS it.
Subsequent frags: look up the IPID in the map.  If we find it, validate
 and update the frag-offset in the map; if this is the last fragment then
 delete the map entry.  If the frag-offset was bogus or the IPID wasn't
 found in the map, DROP; otherwise PASS.
(If re-ordering is prevalent then use something more sophisticated than
 just expected next frag-offset, but the principle is the same. And of
 course you might want to put in timers for expiry etc.)
So this avoids the need to stash the packet data and modify/consume SKBs,
 because you're not actually doing reassembly; the down-side is that the
 BPF program can't so easily make decisions about the application-layer
 contents of the fragmented datagram, but for the common case (we just
 care about the 5-tuple) it's simple enough.
But I haven't actually tried it, so maybe there's some obvious reason why
 it can't work this way.

I don't believe full L4 headers are required in the first fragment.
Sufficiently sneaky attackers can, I think, send a byte at a time to
subvert your proposed algorithm. Storing skb data seems inevitable here.
Someone can correct me if I'm wrong here.

Reordering like you mentioned is another attack vector. Perhaps there
are more sophisticated semi-stateful algorithms that can solve the
problem, but it leads me to my next point.

A semi-stateful method like you are proposing is concerning to me from a
reliability and correctness stand point. Such a method can suffer from
impedance mismatches with the rest of the system. For example, whatever
map sizes you choose should probably be aligned with sysfs conntrack
values otherwise you may get some very interesting and unexpected pkt
drops. I think cilium had a talk about debugging a related conntrack
issue in the same vein a while ago. Furthermore, the debugging and
troubleshooting facilities will be different (counters, logs, etc).

Unless someone has had lots of experience writing an ip stack from
the ground up, I suspect there are quite a few more unknown-unknowns
here. What I find valuable about this patch series is that we can
leverage the well understood and battle hardened kernel facilities. So
avoid all the correctness and security issues that the kernel has spent
20+ years fixing. And make it trivial for the next person that comes
along to do the right thing.

Hopefully this all makes sense.

Thanks,
Daniel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help