Re: RFC - VXLAN port range facility
From: Stephen Hemminger <stephen@networkplumber.org>
Date: 2013-05-31 17:22:37
On Fri, 31 May 2013 13:08:56 -0400 David Stevens [off-list ref] wrote:
Stephen Hemminger [off-list ref] wrote on 05/31/2013 12:13:38 PM:quoted
RFC text: Outer UDP Header: This is the outer UDP header with a source port provided by the VTEP and the destination port being a well known UDP port to be obtained by IANA assignment. It isrecommendedquoted
that the source port be a hash of the inner Ethernet frame'sheadersquoted
to obtain a level of entropy for ECMP/load balancing of the VMto VMquoted
traffic across the VXLAN overlay. You can restrict to a smaller range if that is a requirement of your infrastructure.I'm suggesting the smaller range, because the fix for the part that is broken would become a resource issue for the current, larger default range. [and a "recommended" in a draft doesn't trump 35 years of UDP usage, even if it did say not to bind the ports...]quoted
Normal UDP applications assign their source port from the ephemeral port range, so that is what VXLAN does.Normal UDP applications bind to the source port. If they are unbound, they bind just for the send and then unbind after. They cannot use a port already bound _because_the_bind_prohibits_it. That is, in fact, the entire issue I'm raising. (!) If I have a UDP application that binds to port 35000, no other UDP application will ever use that port until I release it, and any ICMP errors delivered to my socket are triggered by my application. That became no longer true with the addition of VXLAN port ranges, because VXLAN does not use UDP bind, or any of the UDP code, to enforce this. It simply generates a random number in the range, which _can_be_ 35000 or any other bound port, and then sends its own, constructed UDP header using that port. The proper way to fix this would be to actually bind to a port in the range, and retry another port if the binding fails, until the binding succeeds. But as VXLAN picks a randomized source port _for_each_packet_, I'm not suggesting we do that. I'm suggesting, instead, that we bind on all the source ports we will use at start-up, which then reserves those ports for VXLAN and prevents anyone else from binding on them. That solves the issue of binding and unbinding on each packet, but I am not then suggesting that VXLAN should bind on 30,000 ports on start-up. That would be silly, especially on a system whose primary function is not VXLAN. So, the logical next question is: does VXLAN really need a range of 30,000 ports as the "normal" circumstance? I think the answer to that is definitely "no." In fact, just one port would work fine a lot of the time, and when multiple ports are needed, the capability is still there. That suggests changing the *default* range (I suggest to 1 port).
The range could be smaller yes, but that means you are restricting hashing.
My conclusions from that reasoning:
1) VXLAN use of UDP source ports is broken; it cannot use ports that are
already bound, and right now it does
2) while a bind/unbind would work, doing that on every packet is slowThe problem is the bind/unbind is a flow state operation, and doing keeping flow state wouldn't scale.
so,
3) the default port range should be much smaller and VXLAN should bind
in advance to the set of ports it wants to use.Probably should not overlap ephemeral port range for applications.
Now, maybe it wouldn't kill performance, and so doing a bind/unbind per packet is still an option, but that would definitely hurt performance for people who don't actually care about port entropy.
What about a peek operation that just avoids existing ports.
Whether solved by a bind/unbind, pre-binding to a smaller default port
range, or a switch between the two, I think VXLAN *must* follow the
rules in its use of UDP and ensure that it doesn't send using source
ports in use by something else. It can't just generate a random one
and use it without checking it, as it does now.
+-DLS