Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing

From: Tom Herbert <hidden>
Date: 2015-08-30 22:29:06
Also in: linux-api

Possibly related (same subject, not in this thread)

2015-08-29 · Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing · Scott Feldman <hidden>
2015-08-29 · Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing · David Miller <hidden>
2015-08-29 · Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing · Peter Nørlund <hidden>
2015-08-29 · Re: [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing · David Miller <hidden>
2015-08-28 · [PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing · <hidden>

On Sun, Aug 30, 2015 at 2:28 PM, Peter Nørlund [off-list ref] wrote:

On Sat, 29 Aug 2015 13:59:08 -0700
Tom Herbert [off-list ref] wrote:

quoted

On Sat, Aug 29, 2015 at 1:46 PM, David Miller [off-list ref]
wrote:

quoted

From: Peter Nørlund <pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org>
Date: Sat, 29 Aug 2015 22:31:15 +0200

quoted

On Sat, 29 Aug 2015 13:14:29 -0700 (PDT)
David Miller [off-list ref] wrote:

quoted

From: pch-chEQUL3jiZBWk0Htik3J/w@public.gmane.org
Date: Fri, 28 Aug 2015 22:00:47 +0200

quoted

When the routing cache was removed in 3.6, the IPv4 multipath
algorithm changed from more or less being destination-based into
being quasi-random per-packet scheduling. This increases the
risk of out-of-order packets and makes it impossible to use
multipath together with anycast services.

Don't even try to be fancy.

Simply kill the round-robin stuff off completely, and make hash
based routing the one and only mode, no special configuration
stuff necessary.

I like the sound of that! Just to be clear - are you telling me to
stick with L3 and skip the L4 part?

For now it seems best to just do L3 and make ipv4 and ipv6 behave
the same.

This might be simpler if we just go directly to L4 which should be
better load balancing and what most switches are doing anyway. The
hash comes from:

1) If a lookup includes an skb, we just need to call skb_get_hash.
2) If we have a socket and sk->sk_txhash is nonzero then use that.
3) Else compute a hash frome flowi. We don't have the exact functions
for this, but they can be easily derived from __skb_get_hash_flowi4
and __skb_get_hash_flowi6 (i.e. create general get_hash_flowi4 and
get_hash_flowi6 and then call these from skb functions and multipath
lookup).

It would definitely be simpler, and it would be nice to just fetch the
hash directly from the NIC - and for link aggregation it would probably
be fine. But with L4, we always need to consider fragmented packets,
which might cause some packets of a flow to be routed differently - and
with ECMP, the ramifications of suddenly choosing another path for a
flow are worse than for link aggregation. The latency through the
different paths may differ enough to cause out-or-order packets and bad
TCP performance as a consequence. Both Cisco and Juniper routers
defaults to L3 for ECMP - exactly for that reason, I believe. RFC 2991
also points out that ports probably shouldn't be used as part of the
flow key with ECMP.

That's more reason why we need vendors to use IPv6 flow label instead
of ports to do ECMP :-). In any case, if we're fragmenting TCP packets
then we're already in a bad place performance-wise-- we really don't
need to optimize for that case. Albeit, it would be nice if fragments
of packet  followed same path, but the would require devices to not do
L4 hash over ports when MF is set-- I don't know if anyone does that
(I have been meaning to add that to stack).

With anycast it is even worse. Depending on how anycast is used,
changing path may destroy a TCP connection. And without special
treatment of ICMP, ICMP packets may hit another anycast node, causing
PMTU to fail. Cloudflare recognized this and solved it by letting a
user space daemon (pmtud) route ICMP packets through all paths, ensuring
that the anycast node receives the ICMP. But a more efficient solution
is to handle the issue within the kernel.

It might be possible to do L4 on flows using PMTU, and while it
is possible to extract addresses and ports from the ICMP payload, you
can't rely the DF-bit in the ICMP payload, since it comes from the
opposite flow (Flow A->B use PMTU while B->A doesn't). I guess you
can technically reduce the number of possible paths to two, though.

I obviously prefer the default to be L3 with ICMP handling, since I
specifically plan to use ECMP together with anycast (albeit anycasted
load balancers which synchronizes states, although delayed), but I also
recognizes that anycast is a special case. Question is, it is so much a
special case that it belongs outside the vanilla kernel?

OTOH, if the hash is always dependent on fixed fields of a connection
(L3 or L4) then the path can never change during the lifetime of a
connection, this is a bad thing if we want to try a different path
when a connection is failing (via ipv4_negative_advice). This is why
there is value is using sk->sk_txhash as a route selector.

It is stunning that anycast works at all given it's dependency on the
network path being stack, but I suppose it is functionality we'll need
to preserve.

Tom

Regards,
 Peter Nørlund

quoted

Tom

quoted

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help