Thread (17 messages) 17 messages, 5 authors, 2019-09-10

Re: Is bug 200755 in anyone's queue??

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: 2019-09-03 17:56:20

Possibly related (same subject, not in this thread)

On Fri, Aug 30, 2019 at 4:30 PM Willem de Bruijn
[off-list ref] wrote:
On Fri, Aug 30, 2019 at 4:54 AM Eric Dumazet [off-list ref] wrote:
quoted


On 8/29/19 9:26 PM, Willem de Bruijn wrote:
quoted
SO_REUSEPORT was not intended to be used in this way. Opening
multiple connected sockets with the same local port.

But since the interface allowed connect after joining a group, and
that is being used, I guess that point is moot. Still, I'm a bit
surprised that it ever worked as described.

Also note that the default distribution algorithm is not round robin
assignment, but hash based. So multiple consecutive datagrams arriving
at the same socket is not unexpected.

I suspect that this quick hack might "work". It seemed to on the
supplied .c file:

                  score = compute_score(sk, net, saddr, sport,
                                        daddr, hnum, dif, sdif);
                  if (score > badness) {
  -                       if (sk->sk_reuseport) {
  +                       if (sk->sk_reuseport && !sk->sk_state !=
TCP_ESTABLISHED) {
This won't work for a mix of connected and connectionless sockets, of
course (even ignoring the typo), as it only skips reuseport on the
connected sockets.
quoted
quoted
But a more robust approach, that also works on existing kernels, is to
swap the default distribution algorithm with a custom BPF based one (
SO_ATTACH_REUSEPORT_EBPF).
Yes, I suspect that reuseport could still be used by to load-balance incoming packets
targetting the same 4-tuple.

So all sockets would have the same score, and we would select the first socket in
the list (if not applying reuseport hashing)
Can you elaborate a bit?

One option I see is to record in struct sock_reuseport if any port in
the group is connected and, if so, don't return immediately on the
first reuseport_select_sock hit, but continue the search for a higher
scoring connected socket.

Or do return immediately, but do this refined search in
reuseport_select_sock itself, as it has a reference to all sockets in the
group in sock_reuseport->socks[]. Instead of the straightforward hash.
That won't work, as reuseport_select_sock does not have access to
protocol specific data, notably inet_dport.

Unfortunately, what I've come up with so far is not concise and slows
down existing reuseport lookup in a busy port table slot. Note that it
is needed for both ipv4 and ipv6.

Do not break out of the port table slot early, but continue to search
for a higher scored match even after matching a reuseport:

"
   @@ -413,28 +413,39 @@ static struct sock *udp4_lib_lookup2(struct net *net,
                                     struct udp_hslot *hslot2,
                                     struct sk_buff *skb)
 {
+       struct sock *reuseport_result = NULL;
        struct sock *sk, *result;
+       int reuseport_score = 0;
        int score, badness;
        u32 hash = 0;

        result = NULL;
        badness = 0;
        udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
                score = compute_score(sk, net, saddr, sport,
                                      daddr, hnum, dif, sdif);
                if (score > badness) {
-                       if (sk->sk_reuseport) {
+                       if (sk->sk_reuseport &&
+                           sk->sk_state != TCP_ESTABLISHED &&
+                           !reuseport_result) {
                                hash = udp_ehashfn(net, daddr, hnum,
                                                   saddr, sport);
-                               result = reuseport_select_sock(sk, hash, skb,
+                               reuseport_result =
reuseport_select_sock(sk, hash, skb,
                                                        sizeof(struct udphdr));
-                               if (result)
-                                       return result;
+                               if (reuseport_result)
+                                       reuseport_score = score;
+                               continue;
                        }
                        badness = score;
                        result = sk;
                }
        }
+
+       if (badness < reuseport_score)
+               result = reuseport_result;
+
        return result;
"

To break out after the first reuseport hit when it is safe, i.e., when
it holds no connected sockets, requires adding this state to struct
reuseport_sock at __ip4_datagram_connect. And modify
reuseport_select_sock to read this. At least, I have not found a more
elegant solution.
Steve, Re: your point on a scalable QUIC server. That is an
interesting case certainly. Opening a connected socket per flow adds
both memory and port table pressure. I once looked into an SO_TXONLY
udp socket option that does not hash connected sockets into the port
table. In effect receiving on a small set of listening sockets (e.g.,
one per cpu) and sending over separate tx-only sockets. That still
introduces unnecessary memory allocation. OTOH it amortizes some
operations, such as route lookup.

Anyway, that does not fix the immediate issue you reported when using
SO_REUSEPORT as described.
As for the BPF program: good point on accessing the udp port when
skb->data is already beyond the header.

Programs of type sk_filter can use bpf_skb_load_bytes(_relative).
Which I think will work, but have not tested.

As of kernel 4.19 programs of type BPF_PROG_TYPE_SK_REUSEPORT can be
attached (with CAP_SYS_ADMIN). See
tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c for an
example that parses udp headers with bpf_skb_load_bytes.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help