Thread (266 messages) 266 messages, 14 authors, 2021-11-05

Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue

From: Xueming(Steven) Li <hidden>
Date: 2021-09-29 13:25:07

On Wed, 2021-09-29 at 10:20 +0000, Ananyev, Konstantin wrote:
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
In current DPDK framework, each RX
queue
is
pre-loaded with mbufs
for incoming packets. When number of
representors scale out in a
switch domain, the memory consumption
became
significant. Most
important, polling all ports leads to
high
cache miss, high
latency and low throughput.

This patch introduces shared RX
queue.
Ports
with same
configuration in a switch domain
could
share
RX queue set by specifying sharing
group.
Polling any queue using same shared
RX
queue
receives packets from
all member ports. Source port is
identified
by mbuf->port.

Port queue number in a shared group
should be
identical. Queue
index is
1:1 mapped in shared group.

Share RX queue is supposed to be
polled
on
same thread.

Multiple groups is supported by group
ID.
Is this offload specific to the
representor? If
so can this name be changed
specifically to
representor?
Yes, PF and representor in switch domain
could
take advantage.
quoted
If it is for a generic case, how the
flow
ordering will be maintained?
Not quite sure that I understood your
question.
The control path of is
almost same as before, PF and representor
port
still needed, rte flows not impacted.
Queues still needed for each member port,
descriptors(mbuf) will be
supplied from shared Rx queue in my PMD
implementation.
My question was if create a generic
RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload,
multiple
ethdev receive queues land into
the same
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
receive queue, In that case, how the flow
order
is
maintained for respective receive queues.
I guess the question is testpmd forward
stream?
The
forwarding logic has to be changed slightly
in
case
of shared rxq.
basically for each packet in rx_burst result,
lookup
source stream according to mbuf->port,
forwarding
to
target fs.
Packets from same source port could be
grouped as
a
small burst to process, this will accelerates
the
performance if traffic
come from
quoted
quoted
quoted
quoted
quoted
quoted
quoted
limited ports. I'll introduce some common api
to
do
shard rxq forwarding, call it with packets
handling
callback, so it suites for
all forwarding engine. Will sent patches
soon.
All ports will put the packets in to the same
queue
(share queue), right? Does
this means only single core will poll only,
what
will
happen if there are
multiple cores polling, won't it cause problem?

And if this requires specific changes in the
application, I am not sure about
the solution, can't this work in a transparent
way
to
the application?
Discussed with Jerin, new API introduced in v3
2/8
that
aggregate ports
in same group into one new port. Users could
schedule
polling on the
aggregated port instead of all member ports.
The v3 still has testpmd changes in fastpath.
Right?
IMO,
For this
feature, we should not change fastpath of testpmd
application. Instead, testpmd can use aggregated
ports
probably as
separate fwd_engine to show how to use this
feature.
Good point to discuss :) There are two strategies to
polling
a shared
Rxq:
1. polling each member port
   All forwarding engines can be reused to work as
before.
   My testpmd patches are efforts towards this
direction.
   Does your PMD support this?
Not unfortunately. More than that, every application
needs
to
change
to support this model.
Both strategies need user application to resolve port ID
from
mbuf and
process accordingly.
This one doesn't demand aggregated port, no polling
schedule
change.
I was thinking, mbuf will be updated from driver/aggregator
port as
when it
comes to application.
quoted
quoted
quoted
2. polling aggregated port
   Besides forwarding engine, need more work to to
demo
it.
   This is an optional API, not supported by my PMD
yet.
We are thinking of implementing this PMD when it comes
to
it,
ie.
without application change in fastpath
logic.
Fastpath have to resolve port ID anyway and forwarding
according
to
logic. Forwarding engine need to adapt to support shard
Rxq.
Fortunately, in testpmd, this can be done with an
abstract
API.

Let's defer part 2 until some PMD really support it and
tested,
how do
you think?
We are not planning to use this feature so either way it is
OK
to
me.
I leave to ethdev maintainers decide between 1 vs 2.

I do have a strong opinion not changing the testpmd basic
forward
engines
for this feature.I would like to keep it simple as fastpath
optimized and would
like to add a separate Forwarding engine as means to verify
this
feature.
+1 to that.
I don't think it a 'common' feature.
So separate FWD mode seems like a best choice to me.
-1 :)
There was some internal requirement from test team, they need
to
verify
all features like packet content, rss, vlan, checksum,
rte_flow...
to
be working based on shared rx queue.
Then I suppose you'll need to write really comprehensive fwd-
engine
to satisfy your test team :)
Speaking seriously, I still don't understand why do you need all
available fwd-engines to verify this feature.
The shared Rxq is low level feature, need to make sure driver
higher
level features working properly. fwd-engines like csum checks input
packet and enable L3/L4 checksum and tunnel offloads accordingly,
other engines do their own feature verification. All test
automation
could be reused with these engines supported seamlessly.
quoted
From what I understand, main purpose of your changes to test-pmd:
allow to fwd packet though different fwd_stream (TX through
different
HW queue).
Yes, each mbuf in burst come from differnt port, testpmd current
fwd-
engines relies heavily on source forwarding stream, that's why the
patch devide burst result mbufs into sub-burst and use orginal fwd-
engine callback to handle. How to handle is not changed.
quoted
In theory, if implemented in generic and extendable way - that
might be a useful add-on to tespmd fwd functionality.
But current implementation looks very case specific.
And as I don't think it is a common case, I don't see much point
to
pollute
basic fwd cases with it.
Shared Rxq is a ethdev feature that impacts how packets get
handled.
It's natural to update forwarding engines to avoid broken.
Why is that?
All it affects the way you RX the packets.
So why *all* FWD engines have to be updated?
People will ask why some FWD engine can't work? 
Let say what specific you are going to test with macswap vs macfwd
mode for that feature?
If people want to test NIC with real switch, or make sure L2 layer not
get corrupted.
I still think one specific FWD engine is enough to cover majority of
test cases.
Yes, rxonly should be sufficient to verify the fundametal, but to
verify csum, timing, need others. Some back2back test system need io
forwarding, real switch depolyment need macswap...

quoted
The new macro is introduced to minimize performance impact, I'm
also
wondering is there an elegant solution :)
I think Jerin suggested a good alternative with eventdev.
As another approach - might be consider to add an RX callback that
will return packets only for one particular port (while keeping
packets
for other ports cached internally).
This and the aggreate port API could be options in ethdev layer later.
It can't be the fundamental due performance loss and potential cache
miss.
As a 'wild' thought - change testpmd fwd logic to allow multiple TX
queues
per fwd_stream and add a function to do TX switching logic.
But that's probably quite a big change that needs a lot of work. 
quoted
Current performance penalty
is one "if unlikely" per burst.
It is not only about performance impact.
It is about keeping test-pmd code simple and maintainable.
quoted
Think in reverse direction, if we don't update fwd-engines here,
all
malfunction when shared rxq enabled, users can't verify driver
features, are you expecting this?
I expect developers not to rewrite whole test-pmd fwd code for each
new ethdev feature.
Here just abstract duplicated code from fwd-engines, an improvement,
keep test-pmd code simple and mantainable.
Specially for the feature that is not widely used.
Based on the huge memory saving, performance and latency gains, it will
be popular to users.

But the test-pmd is not critical to this feature, I'm ok to drop the
fwd-engine support if you agree.
quoted
quoted
BTW, as a side note, the code below looks bogus to me:
+void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+		   struct rte_mbuf **pkts_burst, packet_fwd_cb
fwd)
+{
+	uint16_t i, nb_fs_rx = 1, port;
+
+	/* Locate real source fs according to mbuf->port. */
+	for (i = 0; i < nb_rx; ++i) {
+		rte_prefetch0(pkts_burst[i + 1]);

you access pkt_burst[] beyond array boundaries,
also you ask cpu to prefetch some unknown and possibly invalid
address.
quoted
Based on the patch, I believe the
impact has been minimized.
quoted
quoted

quoted
quoted
quoted
quoted
quoted
quoted
Overall, is this for optimizing memory for the
port
represontors? If so can't we
have a port representor specific solution,
reducing
scope can reduce the
complexity it brings?
quoted
quoted
If this offload is only useful for
representor
case, Can we make this offload specific to
representor the case by changing its
name and
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
scope.
It works for both PF and representors in same
switch
domain, for application like OVS, few changes
to
apply.
quoted
quoted
quoted
quoted
Signed-off-by: Xueming Li
[off-list ref]
---
 doc/guides/nics/features.rst
quoted
11 +++++++++++
 doc/guides/nics/features/default.ini
quoted
 1 +
 doc/guides/prog_guide/switch_represe
ntat
ion.
rst | 10 ++++++++++
 lib/ethdev/rte_ethdev.c
quoted
 1 +
 lib/ethdev/rte_ethdev.h
quoted
 7 +++++++
 5 files changed, 30 insertions(+)

diff --git
a/doc/guides/nics/features.rst
b/doc/guides/nics/features.rst index
a96e12d155..2e2a9b1554 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -624,6 +624,17 @@ Supports inner
packet L4
checksum.
   ``tx_offload_capa,tx_queue_offload
_cap
a:DE
V_TX_OFFLOAD_OUTER_UDP_CKSUM``.


+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports
in
same
switch domain.
+
+* **[uses]
rte_eth_rxconf,rte_eth_rxmode**:
``offloads:RTE_ETH_RX_OFFLOAD_SHARED_
RXQ`
`.
+* **[provides] mbuf**:
``mbuf.port``.
+
+
 ..
_nic_features_packet_type_parsing:

 Packet type parsing
diff --git
a/doc/guides/nics/features/default.in
i
b/doc/guides/nics/features/default.in
i
index 754184ddd4..ebeb4c1851 100644
---
a/doc/guides/nics/features/default.in
i
+++
b/doc/guides/nics/features/default.in
i
@@ -19,6 +19,7 @@ Free Tx mbuf on
demand
=
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git
a/doc/guides/prog_guide/switch_repres
enta
tion
.rst
b/doc/guides/prog_guide/switch_repres
enta
tion
.rst
index ff6aa91c80..45bf5a3a10 100644
---
a/doc/guides/prog_guide/switch_repres
enta
tion
.rst
+++
b/doc/guides/prog_guide/switch_repres
enta
tion
.rst
@@ -123,6 +123,16 @@ thought as a
software
"patch panel" front-end for
applications.
 .. [1] `Ethernet switch device
driver
model
(switchdev)

<https://www.kernel.org/doc/Documenta
tion
/net
working/switchdev.txt
quoted
`_
+- Memory usage of representors is
huge
when
number of representor
+grows,
+  because PMD always allocate mbuf
for
each
descriptor of Rx queue.
+  Polling the large number of ports
brings
more CPU load, cache
+miss and
+  latency. Shared Rx queue can be
used
to
share Rx queue between
+PF and
+  representors in same switch
domain.
+``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
+  is present in Rx offloading
capability
of
device info. Setting
+the
+  offloading flag in device Rx mode
or
Rx
queue configuration to
+enable
+  shared Rx queue. Polling any
member
port
of shared Rx queue can
+return
+  packets of all ports in group,
port ID
is
saved in ``mbuf.port``.
+
 Basic SR-IOV
 ------------
diff --git a/lib/ethdev/rte_ethdev.c
b/lib/ethdev/rte_ethdev.c
index 9d95cd11e1..1361ff759a 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@ static const
struct {
        RTE_RX_OFFLOAD_BIT2STR(OUTER_
UDP_
CKSU
M),
        RTE_RX_OFFLOAD_BIT2STR(RSS_HA
SH),
        RTE_ETH_RX_OFFLOAD_BIT2STR(BU
FFER
_SPL
IT),
+
RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ
),
 };

 #undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h
b/lib/ethdev/rte_ethdev.h
index d2b27c351f..a578c9db9d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1047,6 +1047,7 @@ struct
rte_eth_rxconf {
        uint8_t rx_drop_en; /**< Drop
packets
if no descriptors are available. */
        uint8_t rx_deferred_start;
/**<
Do
not start queue with
rte_eth_dev_start().
*/
        uint16_t rx_nseg; /**< Number
of
descriptions in rx_seg array.
*/
+       uint32_t shared_group; /**<
Shared
port group index in
+ switch domain. */
        /**
         * Per-queue Rx offloads to
be
set
using DEV_RX_OFFLOAD_* flags.
         * Only offloads set on
rx_queue_offload_capa or
rx_offload_capa @@ -1373,6 +1374,12
@@
struct
rte_eth_conf {
#define
DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH
0x00080000
 #define
RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
0x00100000
+/**
+ * Rx queue is shared among ports in
same
switch domain to save
+memory,
+ * avoid polling each port. Any port
in
group can be used to receive packets.
+ * Real source port number saved in
mbuf-
quoted
port field.
+ */
+#define
RTE_ETH_RX_OFFLOAD_SHARED_RXQ
0x00200000

 #define DEV_RX_OFFLOAD_CHECKSUM
(DEV_RX_OFFLOAD_IPV4_CKSUM | \
                                 DEV_
RX_O
FFLO
AD_UDP_CKSUM | \
--
2.25.1
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help