Re: [PATCH v3 1/2] net: core: Notify on changes to dev->promiscuity.

From: Allan W. Nielsen <hidden>
Date: 2019-09-03 08:14:16
Also in: lkml

Hi Ido,

The 09/03/2019 09:13, Ido Schimmel wrote:

On Mon, Sep 02, 2019 at 07:42:31PM +0200, Allan W. Nielsen wrote:

quoted

I have been reading through this thread several times and I still do not get it.

I kept thinking about this and I want to make sure that I correctly
understand the end result.

So do I, and regardless of this being merged or not, I'm really happy that we
have the discussion as there are clearly a need for synchronizing the meaning of
promiscuity.

One of the reasons why we started working on this patch was the comments we got
when requesting comments on the "[PATCH] net: bridge: Allow bridge to join
multicast groups" where we want to be able to setup mac-entries for L2 multicast
MAC addresses. Here it is important for us to be able to control which ports a
static configured multicast address needs to be forwarded to (including the
CPU). When working on this patch it was pointed out that the interfaces are in
promisc mode anyway (if member of a bridge, and if flooding and learning are
enabled), and it therefore this optimization should not matter anyway.

I found that this was a fair comment, and it has been on the list of things we
wanted to "fix" for a long time.

The optimization does matter to us, as we have until now worked around the issue
by not implementing promisc mode.

When debugging, it is very useful for us to be able to see all the traffic RXed
on the interface, and to us it did make a lot of sense to "just" make promisc
mode work like we was used to.

But, this is for debugging, it is not the most important patch we have on the
backlog, and I would therefore prefer finding a solution which we are all happy
with.

With these patches applied I assume I will see the following traffic
when running tcpdump on one of the netdevs exposed by the ocelot driver:

- Ingress: All
- Egress: Only locally generated traffic and traffic forwarded by the
  kernel from interfaces not belonging to the ocelot driver

The above means I will not see any offloaded traffic transmitted by the
port. Is that correct?

Correct - but maybe we should change this.

In Ocelot and in LANxxxx (the part we are working on now), we can come pretty
close. We can get the offloaded TX traffic to the CPU, but it will not be
re-written (it will look like the ingress frame, which is not always the same as
the egress frame, vlan tags an others may be re-written).

In some of our chips we can actually do this (not Ocelot, and not the LANxxxx
part we are working on now) after the frame as been re-written.

I see that the driver is setting 'offload_fwd_mark' for any traffic trapped
from bridged ports, which means the bridge will drop it before it traverses
the packet taps on egress.

Correct.

Large parts of the discussion revolve around the fact that switch ports
are not any different than other ports. Dave wrote "Please stop
portraying switches as special in this regard" and Andrew wrote "[The
user] just wants tcpdump to work like on their desktop."

And we are trying to get as close to this as practical possible, knowing that it
may not be exactly the same.

But if anything, this discussion proves that switch ports are special in
this regard and that tcpdump will not work like on the desktop.

I think it can come really close. Some drivers may be able to fix the TX issue
you point out, others may not.

Beside the fact that I don't agree (but gave up) with the new
interpretation of promisc mode, I wonder if we're not asking for trouble
with this patchset. Users will see all offloaded traffic on ingress, but
none of it on egress. This is in contrast to the sever/desktop, where
Linux is much more dominant in comparison to switches (let alone hw
accelerated ones) and where all the traffic is visible through tcpdump.
I can already see myself having to explain this over and over again to
confused users.

Now, I understand that showing egress traffic is inherently difficult.
It means one of two things:

1. We allow packets to be forwarded by both the software and the
hardware
2. We trap all ingressing traffic from all the ports

If the HW cannot copy the egress traffic to the CPU (which our HW cannot), then
you need to do both. All ingress traffic needs to go to the CPU, you need to
make all the forwarding decisions in the CPU, to figure out what traffic happens
to go to the port you want to monitor.

I really doubt this will work in real life. Too much traffic, and HW may make
different forwarding decision that the SW (tc rules in HW but not in SW), which
means that it will not be good for debugging anyway.

Both options can have devastating effects on the network and therefore
should not be triggered by a supposedly innocent invocation of tcpdump.

Agree.

I again wonder if it would not be wiser to solve this by introducing two
new flags to tcpdump for ingress/egress (similar to -Q in/out) capturing
of offloaded traffic. The capturing of egress offloaded traffic can be
documented with the appropriate warnings.

Not sure I agree, but I will try to spend some more time considering it.

In the mean while, what TC action was it that Jiri suggestion we should use? The
trap action is no good, and it prevents the forwarding in silicon, and I'm not
aware of a "COPY-TO-CPU" action.

Anyway, I don't want to hold you up, I merely want to make sure that the
above (assuming it's correct) is considered before the patches are
applied.

Sounds good, and thanks for all the time spend on reviewing and asking the
critical questions.

/Allan

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help