Re: [PATCH v3] net/mlx4: fix dev rmv not detected after port stop

From: Adrien Mazarguil <hidden>
Date: 2018-02-02 19:53:21

Hi Matan,

On Wed, Jan 31, 2018 at 05:07:56PM +0000, Matan Azrad wrote:

Hi Adrien

From: Adrien Mazarguil , Sent: Wednesday, January 31, 2018 4:32 PM

quoted

Hi Matan,

On Wed, Jan 31, 2018 at 01:44:41PM +0000, Matan Azrad wrote:

quoted

Hi Adrien

<snip>

quoted

I don't know what any application does but for me it is a mistake to
stop all event processes in dev_stop(), Maybe for other application

maintainers too.

Just like you, I don't know either what all the applications ever written for
DPDK expect out of dev_stop(). What's for sure is that currently, LSC/RMV
don't occcur afterward, the same way these events do not occur before
dev_start().

Why not? RMV event can occur any time after probe.

LSC as well (keep in mind this patch modifies the behavior for both
events). RMV events may also occur before application has a chance to
register a handler for it, in which case this approach fails to solve the
problem it's supposed to solve. Mitigate all you want, the application still
can't rely on that event only.

quoted

Any application possibly relying on this fact will break. In such a
situation, a conservative approach is better.

If an application should fail to get event in stopped state it may fail in the previous code too:
The interrupt run from host thread so the next race may occur:
dev_start() : master thread.
Context switch.
RMV interrupt started to run callbacks: host thread.
Context switch.
dev_stop(): master thread.
Start reconfiguration: master thread. 
Context switch.
Callback running.

So, the only thing which can disable callback running after dev_stop() is callback unregistration before it.

After dev_stop() returns, new events cannot be triggered by the PMD which is
what matters. Obviously a callback that already started to run before that
will eventually have to complete. What's your point?

There's a race only if an application performs multiple simultaneous control
operations on the underlying device, but this has always been unsafe (not
only during RMV) because there are no locks, it's documented as such.

quoted

Setting up RMV/LSC callbacks is not the only configuration an
application usually performs before calling dev_start(). Think about
setting up flow rules, MAC addresses, VLANs, and so on, this on
multiple ports before starting them up all at once. Previously it
could be done in an unspecified order, now they have to take special care

for RMV/LSC.

quoted

Or maybe there callbacks code are already safe for it.
Or they manages the unregister\register calls in the right places.

That's my point, these "maybes" don't argue in favor of changing things.

What I'm saying is that callbacks should be safe or not registered in the right time.

I understand that, though it's not a valid counter argument :)

quoted

Many devops are only safe when called while a device is stopped.
It's even documented in rte_ethdev.h.

And?

...And applications therefore often do all their configuration in an unspecified
order while a port is stopped as a measure of safety. No extra care is taken
for RMV/LSC. This uncertainty can be addressed by not modifying the current
behavior.

Or they expect to get interrupt and the corner case will come later if we will not change it.

Look, we're throwing opposite use cases at each other in order to make a
point, and I don't see an end to this since we're both stubborn. Let's thus
assume applications use a bit of both.

Now we're left with a problem, before this patch neither use cases were
broken. Now it's applied, mine is broken so let's agree something needs to
be done. Either all affected applications need to be updated, or we can
simply revert this and properly fix fail-safe instead.

<snip>

quoted

So, at least for RMV event, we need the notification also in stopped state.

You sent the rte_eth_dev_is_removed() series. You're aware that PMDs
implementing this call benefit from an automatic is_removed() check on all
remaining devops whenever some error occur.
In short, an application will get notified simply by getting dev_start() (or any
other callback) return -EIO and not being able to use the device.

 
Yes, but between dev_stop to dev_start may not be any ethdev API calling.

So what, if an application is not using the device, why does it need to know
it's been removed? If it's that important, why can't it run its own periodic
rte_eth_dev_is_removed() probe?

quoted

PMDs that do not implement is_removed() (or in addition to it) could also
artificially trigger a RMV event after dev_start() is called. As long as the PMD
remains quiet while the device is stopped, it's fine.

How can the PMD do it after dev_start()? Initiate alarm in dev start function to do it later And entering into race again?

What race? All the PMD needs to to is trigger an event by registering one
with immediate effect, it won't make any difference to the application. If
it performs racy control operations from the handler, it's never been a
PMD's problem.

Anyway I just provided this idea as a counter argument, it's not really
needed; relying on rte_eth_dev_is_removed() is the safest approach in my
opinion.

I think it is not worth the effort and this patch is doing the right thing(also some other PMDs) .

Well, it's probably too late to revert this patch for 18.02 since one would
also have to come up with the proper fix for fail-safe, however that doesn't
mean breaking things and expecting applications to deal with it because it's
never been properly documented is OK either.

I'm post-NACKing this patch, it will have to be reverted and a fix submitted
for the upcoming 18.02 stable branch. In the meantime, it's not a fix for
mlx4 and as such must not be backported.

-- 
Adrien Mazarguil
6WIND

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help