Re: [dpdk-stable] [PATCH v4] net/bonding: per-slave intermediate rx ring

From: Chas Williams <3chas3@gmail.com>
Date: 2018-08-23 15:51:06

On Thu, Aug 23, 2018 at 3:28 AM Matan Azrad [off-list ref] wrote:

Hi

From: Eric Kinzie

quoted

On Wed Aug 22 11:42:37 +0000 2018, Matan Azrad wrote:

quoted

Hi Luca

From: Luca Boccassi

quoted

On Wed, 2018-08-22 at 07:09 +0000, Matan Azrad wrote:

quoted

Hi Chas

From: Chas Williams

quoted

On Tue, Aug 21, 2018 at 11:43 AM Matan Azrad
<mailto:matan@mellanox .com> wrote:
Hi Chas

From: Chas Williams

quoted

On Tue, Aug 21, 2018 at 6:56 AM Matan Azrad
<mailto:matan@mellano x.com>
wrote:
Hi

From: Chas Williams

quoted

This will need to be implemented for some of the other RX
burst methods at some point for other modes to see this
performance improvement (with the exception of

active-backup).

quoted

Yes, I think it should be done at least to
bond_ethdev_rx_burst_8023ad_fast_queue (should be easy) for

now.

quoted

There is some duplicated code between the various RX paths.
I would like to eliminate that as much as possible, so I was
going to give that some thought first.

There is no reason to stay this function as is while its twin is
changed.

Unfortunately, this is all the patch I have at this time.

quoted

On Thu, Aug 16, 2018 at 9:32 AM Luca Boccassi
<mailto:bluca@deb ian.org> wrote:

quoted

During bond 802.3ad receive, a burst of packets is fetched
from each slave into a local array and appended to
per-slave ring buffer.
Packets are taken from the head of the ring buffer and
returned to the caller.  The number of mbufs provided to
each slave is sufficient to meet the requirements of the
ixgbe vector receive.

Luca,

Can you explain these requirements of ixgbe?

The ixgbe (and some other Intel PMDs) have vectorized RX
routines that are more efficient (if not faster) taking
advantage of some advanced CPU instructions.  I think you need
to be receiving at least 32 packets or more.

So, why to do it in bond which is a generic driver for all the
vendors PMDs, If for ixgbe and other Intel nics it is better you
can force those PMDs to receive always 32 packets and to manage
a ring by themselves.

The drawback of the ring is some additional latency on the
receive path.
In testing, the additional latency hasn't been an issue for

bonding.

quoted

When bonding does processing slower it may be a bottleneck for the
packet processing for some application.

quoted

The bonding PMD has a fair bit of overhead associated with the
RX and TX path calculations.  Most applications can just arrange
to call the RX path with a sufficiently large receive.  Bonding
can't do this.

I didn't talk on application I talked on the slave PMDs, The slave
PMD can manage a ring by itself if it helps for its own

performance.

quoted

The bonding should not be oriented to specific PMDs.

The issue though is that the performance problem is not with the
individual PMDs - it's with bonding. There were no reports regarding
the individual PMDs.
This comes from reports from customers from real world production
deployments - the issue of bonding being too slow was raised multiple

times.

quoted

This patch addresses those issues, again in production deployments,
where it's been used for years, to users and customers satisfaction.

From Chas I understood that using burst of 32 helps for some slave PMDs

performance which makes sense.

quoted

I can't understand how the extra copy phases improves the bonding

itself

quoted

performance:

quoted

You added 2 copy phases in the bonding RX function:
1. Get packets from the slave to a local array.
2. Copy packet pointers from a local array to the ring array.
3. Copy packet pointers from the ring array to the application array.

Each packet arriving to the application must pass the above 3

phases(in a

quoted

specific call or in previous calls).

quoted

Without this patch we have only -
Get packets from the slave to the application array.

Can you explain how the extra copies improves the bonding performance?

Looks like it improves the slaves PMDs and because of that the bonding

PMD performance becomes better.

I'm not sure that adding more buffer management to the vector PMDs will
improve the drivers' performance; it's just that calling the rx function

in such

quoted

a way that it returns no data wastes time.

Sorry, I don't fully understand what you said here, please rephrase.

quoted

 The bonding driver is already an exercise in buffer management so

adding this layer of indirection here makes

quoted

sense in my opinion, as does hiding the details of the consituent

interfaces where possible.

Can you explain how this new buffer management with the extra pointer
copies improves the bonding itself performance?
Looks really strange to me.

Because rings are generally quite efficient.

Bonding is in a middle ground between application and PMD.  What bonding
is doing, may not improve all applications.  If using a ring to buffer
the vectorized receive routines, improves your particular application,
that's great.  However, I don't think I can say that it would help all
applications.  As you point out, there is overhead associated with
a ring.

Bonding's receive burst isn't especially efficient (in mode 4).  Bonding
benefits from being able to read as much as possible (within limits of
course, large reads would blow out caches) from each slave.  It can't
return all that data though because applications tend to use the
burst size that would be efficient for a typical PMD.  An alternative
might be to ask bonding applications to simply issue larger reads for
certain modes.  That's probably not as easy as it sounds given the
way that the burst length effects multiplexing.

Another solution might be just alternatively poll the individual
slaves on each rx burst.  But that means you need to poll at a
faster rate.  Depending on your application, you might not be
able to do that.  We can avoid this scheduling overhead by just
doing the extra reads in bonding and buffering in a ring.

Since bonding is going to be buffering everything in a ring, it makes
sense to just read as much as is as efficiently possible.  For the
Intel adapters, this means using a read big enough trigger the vectorized
versions.

quoted

So I'd like to share this improvement rather than keeping it private
- because I'm nice that way :-P

quoted

Did you check for other vendor PMDs? It may hurt performance
there..

I don't know, but I suspect probably not.  For the most part
you are typically reading almost up to the vector requirement.
But if one slave has just a single packet, then you can't
vectorize on the next slave.

I don't think that the ring overhead is better for PMDs which
are not using the vectorized instructions.

The non-vectorized PMDs are usually quite slow.  The additional
overhead doesn't make a difference in their performance.

We should not do things worse than they are.

There were no reports that this made things worse. The feedback from
production was that it made things better.

Yes, It may be good for specific slaves drivers but hurt another
slaves drivers, So maybe it should stay private to specific costumers

using

quoted

specific nics.

quoted

Again, I can understand how this patch improves performance of some
PMDs therefore I think the bonding is not the place to add it but maybe

some PMDs.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help