Thread (266 messages) 266 messages, 14 authors, 2021-11-05

Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue

From: Xueming(Steven) Li <hidden>
Date: 2021-09-29 14:54:34

On Wed, 2021-09-29 at 12:35 +0000, Ananyev, Konstantin wrote:
quoted
-----Original Message-----
From: Xueming(Steven) Li <redacted>
Sent: Wednesday, September 29, 2021 1:09 PM
To: jerinjacobk@gmail.com; Ananyev, Konstantin <redacted>
Cc: NBU-Contact-Thomas Monjalon <redacted>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
[off-list ref]
Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue> 
On Wed, 2021-09-29 at 09:52 +0000, Ananyev, Konstantin wrote:
quoted
quoted
-----Original Message-----
From: Xueming(Steven) Li <redacted>
Sent: Wednesday, September 29, 2021 10:13 AM
To: jerinjacobk@gmail.com; Ananyev, Konstantin <redacted>
Cc: NBU-Contact-Thomas Monjalon <redacted>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
[off-list ref]
Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue

On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
In current DPDK framework, each RX queue
is
pre-loaded with mbufs
for incoming packets. When number of
representors scale out in a
switch domain, the memory consumption
became
significant. Most
important, polling all ports leads to
high
cache miss, high
latency and low throughput.

This patch introduces shared RX queue.
Ports
with same
configuration in a switch domain could
share
RX queue set by specifying sharing group.
Polling any queue using same shared RX
queue
receives packets from
all member ports. Source port is
identified
by mbuf->port.

Port queue number in a shared group
should be
identical. Queue
index is
1:1 mapped in shared group.

Share RX queue is supposed to be polled
on
same thread.

Multiple groups is supported by group ID.
Is this offload specific to the
representor? If
so can this name be changed specifically to
representor?
Yes, PF and representor in switch domain
could
take advantage.
quoted
If it is for a generic case, how the flow
ordering will be maintained?
Not quite sure that I understood your
question.
The control path of is
almost same as before, PF and representor
port
still needed, rte flows not impacted.
Queues still needed for each member port,
descriptors(mbuf) will be
supplied from shared Rx queue in my PMD
implementation.
My question was if create a generic
RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
ethdev receive queues land into
the same
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
receive queue, In that case, how the flow order
is
maintained for respective receive queues.
I guess the question is testpmd forward stream?
The
forwarding logic has to be changed slightly in
case
of shared rxq.
basically for each packet in rx_burst result,
lookup
source stream according to mbuf->port, forwarding
to
target fs.
Packets from same source port could be grouped as
a
small burst to process, this will accelerates the
performance if traffic
come from
quoted
quoted
quoted
quoted
quoted
quoted
quoted
limited ports. I'll introduce some common api to
do
shard rxq forwarding, call it with packets
handling
callback, so it suites for
all forwarding engine. Will sent patches soon.
All ports will put the packets in to the same queue
(share queue), right? Does
this means only single core will poll only, what
will
happen if there are
multiple cores polling, won't it cause problem?

And if this requires specific changes in the
application, I am not sure about
the solution, can't this work in a transparent way
to
the application?
Discussed with Jerin, new API introduced in v3 2/8
that
aggregate ports
in same group into one new port. Users could schedule
polling on the
aggregated port instead of all member ports.
The v3 still has testpmd changes in fastpath. Right?
IMO,
For this
feature, we should not change fastpath of testpmd
application. Instead, testpmd can use aggregated ports
probably as
separate fwd_engine to show how to use this feature.
Good point to discuss :) There are two strategies to
polling
a shared
Rxq:
1. polling each member port
   All forwarding engines can be reused to work as
before.
   My testpmd patches are efforts towards this direction.
   Does your PMD support this?
Not unfortunately. More than that, every application needs
to
change
to support this model.
Both strategies need user application to resolve port ID from
mbuf and
process accordingly.
This one doesn't demand aggregated port, no polling schedule
change.
I was thinking, mbuf will be updated from driver/aggregator
port as
when it
comes to application.
quoted
quoted
quoted
2. polling aggregated port
   Besides forwarding engine, need more work to to demo
it.
   This is an optional API, not supported by my PMD yet.
We are thinking of implementing this PMD when it comes to
it,
ie.
without application change in fastpath
logic.
Fastpath have to resolve port ID anyway and forwarding
according
to
logic. Forwarding engine need to adapt to support shard Rxq.
Fortunately, in testpmd, this can be done with an abstract
API.

Let's defer part 2 until some PMD really support it and
tested,
how do
you think?
We are not planning to use this feature so either way it is OK
to
me.
I leave to ethdev maintainers decide between 1 vs 2.

I do have a strong opinion not changing the testpmd basic
forward
engines
for this feature.I would like to keep it simple as fastpath
optimized and would
like to add a separate Forwarding engine as means to verify
this
feature.
+1 to that.
I don't think it a 'common' feature.
So separate FWD mode seems like a best choice to me.
-1 :)
There was some internal requirement from test team, they need to
verify
all features like packet content, rss, vlan, checksum, rte_flow...
to
be working based on shared rx queue.
Then I suppose you'll need to write really comprehensive fwd-engine
to satisfy your test team :)
Speaking seriously, I still don't understand why do you need all
available fwd-engines to verify this feature.
From what I understand, main purpose of your changes to test-pmd:
allow to fwd packet though different fwd_stream (TX through different
HW queue).
In theory, if implemented in generic and extendable way - that
might be a useful add-on to tespmd fwd functionality.
But current implementation looks very case specific.
And as I don't think it is a common case, I don't see much point to
pollute
basic fwd cases with it.

BTW, as a side note, the code below looks bogus to me:
+void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst, packet_fwd_cb
fwd)
+{
+	uint16_t i, nb_fs_rx = 1, port;
+
+	/* Locate real source fs according to mbuf->port. */
+	for (i = 0; i < nb_rx; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);

you access pkt_burst[] beyond array boundaries,
also you ask cpu to prefetch some unknown and possibly invalid
address.
Sorry I forgot this topic. It's too late to prefetch current packet, so
perfetch next is better. Prefetch an invalid address at end of a look
doesn't hurt, it's common in DPDK.
First of all it is usually never 'OK' to access array beyond its bounds.
Second prefetching invalid address *does* hurt performance badly on many CPUs
(TLB misses, consumed memory bandwidth etc.).
As a reference:  https://lwn.net/Articles/444346/
If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
More important - it is really bad attitude to submit bogus code to DPDK community
and pretend that it is 'OK'.
Thanks for the link!
From instruction spec, "The PREFETCHh instruction is merely a hint and
does not affect program behavior."
There are 3 choices here:
1: no prefetch. D$ miss will happen on each packet, time cost depends
on where data sits(close or far) and burst size.
2: prefetch with loop end check to avoid random address. Pro is free of
TLB miss per burst, Cons is "if" instruction per packet. Cost depends
on burst size.
3: brute force prefetch, cost is TLB miss, but no addtional
instructions per packet. Not sure how random the last address could be
in testpmd and how many TLB miss could happen.
There are plenty of standard techniques to avoid that issue while keeping
prefetch() in place.
Probably the easiest one:

for (i=0; i < nb_rx - 1; i++) {
    prefetch(pkt[i+1];
    /* do your stuff with pkt[i[ here */
}

/* do your stuff with pkt[nb_rx - 1]; */
Thanks, will update in next version.
 
quoted
Based on my expericen of performance optimization, IIRC, option 3 has
the best performance. But for this case, result depends on how many
sub-burst inside and how sub-burst get processed, maybe callback will
flush prefetch data completely or not. So it's hard to get a
conclusion, what I said is that the code in PMD driver should have a
reason.

On the other hand, the latency and throughput saving of this featue on
multiple ports is huge, I perfer to down play this prefetch discussion
if you agree.
Honestly, I don't know how else to explain to you that there is a bug in that piece of code.
From my perspective it is a trivial bug, with a trivial fix.
But you simply keep ignoring the arguments.
Till it get fixed and other comments addressed - my vote is NACK for these series.
I don't think we need bogus code in testpmd.
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help