Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
From: Alexei Starovoitov <hidden>
Date: 2019-03-24 16:19:24
On Sat, Mar 23, 2019 at 10:36:24PM -0700, Eric Dumazet wrote:
On 03/23/2019 08:41 AM, Alexei Starovoitov wrote:quoted
On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote:quoted
On 03/23/2019 01:05 AM, brakmo wrote:quoted
This patchset adds support for propagating congestion notifications (cn) to TCP from cgroup inet skb egress BPF programs. Current cgroup skb BPF programs cannot trigger TCP congestion window reductions, even when they drop a packet. This patch-set adds support for cgroup skb BPF programs to send congestion notifications in the return value when the packets are TCP packets. Rather than the current 1 for keeping the packet and 0 for dropping it, they can now return: NET_XMIT_SUCCESS (0) - continue with packet output NET_XMIT_DROP (1) - drop packet and do cn NET_XMIT_CN (2) - continue with packet output and do cn -EPERM - drop packetI believe I already mentioned this model is broken, if you have any virtual device before the cgroup BPF program. Please think about offloading the pacing/throttling in the NIC, there is no way we will report back to tcp stack instant notifications.I don't think 'offload to google proprietary nic' is a suggestion that folks can practically follow. Very few NICs can offload pacing to hw and there are plenty of limitations. This patch set represents a pure sw solution that works and scales to millions of flows.quoted
This patch series is going way too far for my taste.I would really appreciate if you can do a technical review of the patches. Our previous approach didn't quite work due to complexity around locked/non-locked socket. This is a cleaner approach. Either we go with this one or will add a bpf hook into __tcp_transmit_skb. This approach is better since it works for other protocols and can be used by qdiscs w/o any bpf.quoted
This idea is not new, you were at Google when it was experimented by Nandita and others, and we know it is not worth the pain.google networking needs are different from the rest of the world.This has nothing to do with Google against Facebook really, it is a bit sad you react like this Alexei. We just upstreamed bpf_skb_ecn_set_ce(), so I doubt you already have numbers to show that this strategy is not enough. All recent discussions about ECN (TCP Prague and SCE) do not _require_ instant feedback to the sender. Please show us experimental results before we have to carry these huge hacks.
I have a feeling we're looking at different patches.
All of your comments seems to be talking about something else.
I have a hard time connecting them to this patch set.
Have you read the cover letter and patches of _this_ set?
The results are in the cover letter and there is no change in behavior in networking stack.
Patches 1, 2, 3 are simple refactoring of bpf-cgroup return codes.
Patch 4 is the only one that touches net/ipv4/ip_output.c only to pass
the return code to tcp/udp layer.
The concept works for both despite of what you're claiming being tcp only.
Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
Please realize that existing qdiscs already doing this.
The patchset allows bpf-cgroup to do the same.
If you were arguing about sysctl knob in patch 5 that I would understand.
Instead of a knob we can hard code the value for now. But please explain
what issues you see with that knob.
Also to be fair I applied your
commit f11216b24219 ("bpf: add skb->tstamp r/w access from tc clsact and cg skb progs")
without demanding 'experimental results' because the feature makes sense.
Yet, you folks didn't produce a _single_ example or test result since then.
This is not cool.