Re: RFC - VXLAN port range facility | netdev

Re: RFC - VXLAN port range facility

From: Stephen Hemminger <stephen@networkplumber.org>
Date: 2013-05-31 17:22:37

On Fri, 31 May 2013 13:08:56 -0400
David Stevens [off-list ref] wrote:

Stephen Hemminger [off-list ref] wrote on 05/31/2013 
12:13:38 PM:

quoted

RFC text:
 Outer UDP Header:  This is the outer UDP header with a source
        port provided by the VTEP and the destination port being a well
        known UDP port to be obtained by IANA assignment. It is

recommended

quoted

        that the source port be a hash of the inner Ethernet frame's

headers

quoted

        to obtain a level of entropy for ECMP/load balancing of the VM

to VM

quoted

        traffic across the VXLAN overlay.


You can restrict to a smaller range if that is a requirement of your
infrastructure.

        I'm suggesting the smaller range, because the fix for the part
that is broken would become a resource issue for the current, larger
default range.
        [and a "recommended" in a draft doesn't trump 35 years of UDP
                usage, even if it did say not to bind the ports...]

quoted

Normal UDP applications assign their source port from the ephemeral 
port range,
so that is what VXLAN does.

        Normal UDP applications bind to the source port. If they are
unbound, they bind just for the send and then unbind after. They
cannot use a port already bound _because_the_bind_prohibits_it.
        That is, in fact, the entire issue I'm raising. (!) If I have
a UDP application that binds to port 35000, no other UDP application
will ever use that port until I release it, and any ICMP errors delivered
to my socket are triggered by my application.
        That became no longer true with the addition of VXLAN port ranges,
because VXLAN does not use UDP bind, or any of the UDP code, to enforce
this. It simply generates a random number in the range, which _can_be_
35000 or any other bound port, and then sends its own, constructed UDP
header using that port.

        The proper way to fix this would be to actually bind to a port in
the range, and retry another port if the binding fails, until the binding
succeeds. But as VXLAN picks a randomized source port _for_each_packet_,
I'm not suggesting we do that.
        I'm suggesting, instead, that we bind on all the source ports we
will use at start-up, which then reserves those ports for VXLAN and
prevents anyone else from binding on them.
        That solves the issue of binding and unbinding on each packet,
but I am not then suggesting that VXLAN should bind on 30,000 ports on
start-up. That would be silly, especially on a system whose primary 
function
is not VXLAN.
        So, the logical next question is: does VXLAN really need a range
of 30,000 ports as the "normal" circumstance? I think the answer to that
is definitely "no." In fact, just one port would work fine a lot of the
time, and when multiple ports are needed, the capability is still there.
That suggests changing the *default* range (I suggest to 1 port).

The range could be smaller yes, but that means you are restricting
hashing.

        My conclusions from that reasoning:

1) VXLAN use of UDP source ports is broken; it cannot use ports that are
        already bound, and right now it does
2) while a bind/unbind would work, doing that on every packet is slow

The problem is the bind/unbind is  a flow state operation, and
doing keeping flow state wouldn't scale.

so,

3) the default port range should be much smaller and VXLAN should bind
        in advance to the set of ports it wants to use.

Probably should not overlap ephemeral port range for applications.

Now, maybe it wouldn't kill performance, and so doing a bind/unbind per
packet is still an option, but that would definitely hurt performance
for people who don't actually care about port entropy.

What about a peek operation that just avoids existing ports.

Whether solved by a bind/unbind, pre-binding to a smaller default port
range, or a switch between the two, I think VXLAN *must* follow the
rules in its use of UDP and ensure that it doesn't send using source
ports in use by something else. It can't just generate a random one
and use it without checking it, as it does now.

                                                                +-DLS

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help