Re: Is bug 200755 in anyone's queue??
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: 2019-09-03 17:56:20
Possibly related (same subject, not in this thread)
- 2019-08-23 · RE: Is bug 200755 in anyone's queue?? · Steve Zabele <hidden>
On Fri, Aug 30, 2019 at 4:30 PM Willem de Bruijn [off-list ref] wrote:
On Fri, Aug 30, 2019 at 4:54 AM Eric Dumazet [off-list ref] wrote:quoted
On 8/29/19 9:26 PM, Willem de Bruijn wrote:quoted
SO_REUSEPORT was not intended to be used in this way. Opening multiple connected sockets with the same local port. But since the interface allowed connect after joining a group, and that is being used, I guess that point is moot. Still, I'm a bit surprised that it ever worked as described. Also note that the default distribution algorithm is not round robin assignment, but hash based. So multiple consecutive datagrams arriving at the same socket is not unexpected. I suspect that this quick hack might "work". It seemed to on the supplied .c file: score = compute_score(sk, net, saddr, sport, daddr, hnum, dif, sdif); if (score > badness) { - if (sk->sk_reuseport) { + if (sk->sk_reuseport && !sk->sk_state != TCP_ESTABLISHED) {This won't work for a mix of connected and connectionless sockets, of course (even ignoring the typo), as it only skips reuseport on the connected sockets.quoted
quoted
But a more robust approach, that also works on existing kernels, is to swap the default distribution algorithm with a custom BPF based one ( SO_ATTACH_REUSEPORT_EBPF).Yes, I suspect that reuseport could still be used by to load-balance incoming packets targetting the same 4-tuple. So all sockets would have the same score, and we would select the first socket in the list (if not applying reuseport hashing)Can you elaborate a bit? One option I see is to record in struct sock_reuseport if any port in the group is connected and, if so, don't return immediately on the first reuseport_select_sock hit, but continue the search for a higher scoring connected socket. Or do return immediately, but do this refined search in reuseport_select_sock itself, as it has a reference to all sockets in the group in sock_reuseport->socks[]. Instead of the straightforward hash.
That won't work, as reuseport_select_sock does not have access to
protocol specific data, notably inet_dport.
Unfortunately, what I've come up with so far is not concise and slows
down existing reuseport lookup in a busy port table slot. Note that it
is needed for both ipv4 and ipv6.
Do not break out of the port table slot early, but continue to search
for a higher scored match even after matching a reuseport:
"
@@ -413,28 +413,39 @@ static struct sock *udp4_lib_lookup2(struct net *net,
struct udp_hslot *hslot2,
struct sk_buff *skb)
{
+ struct sock *reuseport_result = NULL;
struct sock *sk, *result;
+ int reuseport_score = 0;
int score, badness;
u32 hash = 0;
result = NULL;
badness = 0;
udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
score = compute_score(sk, net, saddr, sport,
daddr, hnum, dif, sdif);
if (score > badness) {
- if (sk->sk_reuseport) {
+ if (sk->sk_reuseport &&
+ sk->sk_state != TCP_ESTABLISHED &&
+ !reuseport_result) {
hash = udp_ehashfn(net, daddr, hnum,
saddr, sport);
- result = reuseport_select_sock(sk, hash, skb,
+ reuseport_result =
reuseport_select_sock(sk, hash, skb,
sizeof(struct udphdr));
- if (result)
- return result;
+ if (reuseport_result)
+ reuseport_score = score;
+ continue;
}
badness = score;
result = sk;
}
}
+
+ if (badness < reuseport_score)
+ result = reuseport_result;
+
return result;
"
To break out after the first reuseport hit when it is safe, i.e., when
it holds no connected sockets, requires adding this state to struct
reuseport_sock at __ip4_datagram_connect. And modify
reuseport_select_sock to read this. At least, I have not found a more
elegant solution.
Steve, Re: your point on a scalable QUIC server. That is an interesting case certainly. Opening a connected socket per flow adds both memory and port table pressure. I once looked into an SO_TXONLY udp socket option that does not hash connected sockets into the port table. In effect receiving on a small set of listening sockets (e.g., one per cpu) and sending over separate tx-only sockets. That still introduces unnecessary memory allocation. OTOH it amortizes some operations, such as route lookup. Anyway, that does not fix the immediate issue you reported when using SO_REUSEPORT as described.
As for the BPF program: good point on accessing the udp port when skb->data is already beyond the header. Programs of type sk_filter can use bpf_skb_load_bytes(_relative). Which I think will work, but have not tested. As of kernel 4.19 programs of type BPF_PROG_TYPE_SK_REUSEPORT can be attached (with CAP_SYS_ADMIN). See tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c for an example that parses udp headers with bpf_skb_load_bytes.