Re: macvlan devices and vlan interaction

From: Alexander Duyck <hidden>
Date: 2018-01-30 20:49:24

On Tue, Jan 30, 2018 at 12:29 PM, Shannon Nelson
[off-list ref] wrote:

On 1/29/2018 3:01 PM, Keller, Jacob E wrote:

quoted

Hi,

I'm currently investigating how macvlan devices behave in regards to vlan
support, and found some interesting behavior that I am not sure how best to
correct, or what the right path forward is.

If I create a macvlan device:

ip link add link ens0 name macvlan0 type macvlan:

and then add a VLAN to it:

ip link add link macvlan0 name vlan10 type vlan id 10

This works to pass VLAN 10 traffic over the macvlan device. This seems
like expected behavior.

However, if I then also add vlan 10 to the lowerdev:

ip link add link ens0 name lowervlan10  type vlan id 10

Then traffic stops flowing to the VLAN on the macvlan device.

This happens, as far as I can tell, because of how the VLAN traffic is
filtered first, and then forwarded to the VLAN device, which doesn't know
about how the macvlan device exists.

It seems, essentially, that vlan stacked on top of a macvlan shouldn't
work. Because the vlan code basically expects each vlan to apply to every
MAC address, and the macvlan device works by putting its MAC address into
the unicast address list, there's no way for a device driver to know when or
how to apply the vlan.

This gets a bit more confusing when we add in the l2 fwd hardware offload.

Currently, at least for the Intel network parts, this isn't supported,
because of a bug in which the device drivers don't apply the VLANs to the
macvlan accelerated addresses. If we fix this, at least for fm10k, the
behavior is slightly better, because of how the hardware filtering at the
MAC address happens first, and we direct the traffic to the proper device
regardless of VLAN.

In addition to this peculiarity of VLANs on both the macvlan and lowerdev,
is that when a macvlan device adds a VLAN, the lowerdev gets an indication
to add the vlan via its .ndo_vlan_rx_add_vid(), which doesn't distinguish
between which addresses the VLAN might apply to. It thus simply, depending
on hardware design, enables the VLAN for all its unicast and multicast
addresses. Some hardware could theoretically support MAC+VLAN pairs, where
it could distinguish that a VLAN should only be added for some subset of
addresses. Other hardware might not be so lucky..

Unfortunately, this has the weird consequence that if we have the
following stack of devices:

vlan10@macvlan0
macvlan0@ens0
ens0

Then ens0 will receive VLAN10 traffic on every address. So VLAN 10 traffic
destined to the MAC of the lowerdev will be received, instead of dropped.

If we add VLAN 10 to the lowerdev so we have both the above stack and also

lowervlan10@ens0
ens0 (mac gg:hh:ii:jj:kk)

then all vlan 10 traffic will be received on the lowerdev VLAN 10, without
any being forwarded to the VLAN10 attached to the macvlan.

However, if we add two macvlans, and each add the vlan10, so we have the
following:

avlan10@macvlan0
macvlan0@ens0
ens0

bvlan10@macvlan1
macvlan1@ens0
ens0

In this case, it does appear that traffic is sorted out correctly. It
seems that only if the lowerdev gets the VLAN does it end up breaking. If I
remove bvlan10 from macvlan1, the traffic associated with vlan10 is still
received by macvlan1, even though in principle it should no longer be.

What is the correct behavior here? Should this just be "administrators
should know better"? I don't think that's a great argument, and either way
we're still essentially leaking VLANs across the macvlan interfaces, which I
don't think is ideal.

I see two possible solutions:

1) modify macvlan driver so that it is marked as VLAN_CHALLENGED, and thus
indicate it cannot handle VLAN traffic on top of it.
   a. In order to get the VLANs associated, administrator could instead
add the VLAN first, and then add the macvlan on top. This I think is a
better configuration.
   b. that doesn't work in the offload case, unless/until we fix the VLAN
interface to forward the l2_dfwd_add_station() along with a vid.
   c. this could appear as loss of functionality, since in some cases
these VLAN on top of macvlan work today (with the interesting caveats listed
above).

2) modify how VLANs interact with MAC addresses, so that the lowerdev can
explicitly be aware of which VLANs are tied to which address groups, in
order to allow for the explicit configuration of which MAC+VLAN pairs are
actually allowed.
   a. this is a much more invasive change to driver interface, and more
difficult to get right
   b. possibly other configurations of stacked devices might have a
similar problem, so we could solve more here? Or create more problems.. I'm
not really certain.


I think the correct solution is (1) but I wasn't sure what others thought,
and whether anyone else has encountered the problems I mention and outline
above. I cc'd Alex who I discussed with offline when I first heard of and
began investigating this, in case he has anything further to add.

Regards,
Jake

Hi Jake,

The current behavior seems logical to me, but I suppose Alex might argue
differently.  The macvlan was put onto the default lowerdev assuming the
lowerdev will hand it all the default traffic, and then the macvlan splits
out its own vlan traffic.  As soon as the lowerdev assumption changes, it is
going to change what gets pushed up to the macvlan dev. If the lowerdev is
separating the vlan traffic out of the "default" flow headed to the macvlan,
then the initial assumption has changed and the vlan traffic has been
vectored off before it can be delivered up the stack to the macvlan.

It depends on what your goal is. In my mind making macvlan VLAN
challenged is the easier solution since you just have to add some
pass-thru ops to the VLAN drivers and you can guarantee that you are
passing MAC-VLAN pair for each address on the interface for the call.
The alternative gets to be a bit more complex since it requires
multiple rules, one for non-tagged and one per VLAN for tagged
traffic.

There's an argument that the lowerdev shouldn't know anything about the
upperdev's routing, just deliver to the upperdev and let the upperdev worry
about it.  But perhaps this becomes is a question of precedence: does the
lowerdev split traffic first by mac address or by vlan tag.

That is where things get messy. We found it splits by VLAN tag if the
VLAN is present on the lowerdev, or it splits by MAC if it is not.
That is why as Jake pointed out adding the VLAN to the lower dev
causes issues.

I don't like your option 1: as you point out, it breaks current
functionality, likely depended upon in some containers that are using
macvlans to manage their traffic.  We don't know what's going on inside that
container and I don't think we want to break its ability to split its own
vlans.

Maybe we should look at an option 1.5. Mark the lowerdev as VLAN
challenged if any macvlan is operating with any VLANs enabled on it
since we can only really allow VLAN filtering to occur at one level
reliably. Either that or maybe we look at making VLANs and rx_handler
setups mutually exclusive.

Like I said, I think the current behavior is mostly correct, but a version
of option 2 might be good to help support offload of the mac+vlan pair into
a macvlan channel.

The only issue is I am not completely sure how option 2 solves the
original issue. Yes it makes the filtering more explicit, but the
network stack is still filtering VLANs before we get to the rx_handler
calls, or is this a fix that works for the offloaded approach only and
doesn't address the issues in the non-offloaded case? It's also
possible I might have missed something.

- Alex

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help