Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
From: Xueming(Steven) Li <hidden>
Date: 2021-09-29 12:08:52
On Wed, 2021-09-29 at 09:52 +0000, Ananyev, Konstantin wrote:
quoted
-----Original Message----- From: Xueming(Steven) Li <redacted> Sent: Wednesday, September 29, 2021 10:13 AM To: jerinjacobk@gmail.com; Ananyev, Konstantin <redacted> Cc: NBU-Contact-Thomas Monjalon <redacted>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh [off-list ref] Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
In current DPDK framework, each RX queue is pre-loaded with mbufs for incoming packets. When number of representors scale out in a switch domain, the memory consumption became significant. Most important, polling all ports leads to high cache miss, high latency and low throughput. This patch introduces shared RX queue. Ports with same configuration in a switch domain could share RX queue set by specifying sharing group. Polling any queue using same shared RX queue receives packets from all member ports. Source port is identified by mbuf->port. Port queue number in a shared group should be identical. Queue index is 1:1 mapped in shared group. Share RX queue is supposed to be polled on same thread. Multiple groups is supported by group ID.Is this offload specific to the representor? If so can this name be changed specifically to representor?Yes, PF and representor in switch domain could take advantage.quoted
If it is for a generic case, how the flow ordering will be maintained?Not quite sure that I understood your question. The control path of is almost same as before, PF and representor port still needed, rte flows not impacted. Queues still needed for each member port, descriptors(mbuf) will be supplied from shared Rx queue in my PMD implementation.My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land intothe samequoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
receive queue, In that case, how the flow order is maintained for respective receive queues.I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq. basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs. Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if trafficcome fromquoted
quoted
quoted
quoted
quoted
quoted
quoted
limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for all forwarding engine. Will sent patches soon.All ports will put the packets in to the same queue (share queue), right? Does this means only single core will poll only, what will happen if there are multiple cores polling, won't it cause problem? And if this requires specific changes in the application, I am not sure about the solution, can't this work in a transparent way to the application?Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports in same group into one new port. Users could schedule polling on the aggregated port instead of all member ports.The v3 still has testpmd changes in fastpath. Right? IMO, For this feature, we should not change fastpath of testpmd application. Instead, testpmd can use aggregated ports probably as separate fwd_engine to show how to use this feature.Good point to discuss :) There are two strategies to polling a shared Rxq: 1. polling each member port All forwarding engines can be reused to work as before. My testpmd patches are efforts towards this direction. Does your PMD support this?Not unfortunately. More than that, every application needs to change to support this model.Both strategies need user application to resolve port ID from mbuf and process accordingly. This one doesn't demand aggregated port, no polling schedule change.I was thinking, mbuf will be updated from driver/aggregator port as when it comes to application.quoted
quoted
quoted
2. polling aggregated port Besides forwarding engine, need more work to to demo it. This is an optional API, not supported by my PMD yet.We are thinking of implementing this PMD when it comes to it, ie. without application change in fastpath logic.Fastpath have to resolve port ID anyway and forwarding according to logic. Forwarding engine need to adapt to support shard Rxq. Fortunately, in testpmd, this can be done with an abstract API. Let's defer part 2 until some PMD really support it and tested, how do you think?We are not planning to use this feature so either way it is OK to me. I leave to ethdev maintainers decide between 1 vs 2. I do have a strong opinion not changing the testpmd basic forward engines for this feature.I would like to keep it simple as fastpath optimized and would like to add a separate Forwarding engine as means to verify this feature.+1 to that. I don't think it a 'common' feature. So separate FWD mode seems like a best choice to me.-1 :) There was some internal requirement from test team, they need to verify all features like packet content, rss, vlan, checksum, rte_flow... to be working based on shared rx queue.Then I suppose you'll need to write really comprehensive fwd-engine to satisfy your test team :) Speaking seriously, I still don't understand why do you need all available fwd-engines to verify this feature. From what I understand, main purpose of your changes to test-pmd: allow to fwd packet though different fwd_stream (TX through different HW queue). In theory, if implemented in generic and extendable way - that might be a useful add-on to tespmd fwd functionality. But current implementation looks very case specific. And as I don't think it is a common case, I don't see much point to pollute basic fwd cases with it. BTW, as a side note, the code below looks bogus to me: +void +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx, + struct rte_mbuf **pkts_burst, packet_fwd_cb fwd) +{ + uint16_t i, nb_fs_rx = 1, port; + + /* Locate real source fs according to mbuf->port. */ + for (i = 0; i < nb_rx; ++i) { + rte_prefetch0(pkts_burst[i + 1]); you access pkt_burst[] beyond array boundaries, also you ask cpu to prefetch some unknown and possibly invalid address.Sorry I forgot this topic. It's too late to prefetch current packet, so perfetch next is better. Prefetch an invalid address at end of a look doesn't hurt, it's common in DPDK.First of all it is usually never 'OK' to access array beyond its bounds. Second prefetching invalid address *does* hurt performance badly on many CPUs (TLB misses, consumed memory bandwidth etc.). As a reference: https://lwn.net/Articles/444346/ If some existing DPDK code really does that - then I believe it is an issue and has to be addressed. More important - it is really bad attitude to submit bogus code to DPDK community and pretend that it is 'OK'.
Thanks for the link! From instruction spec, "The PREFETCHh instruction is merely a hint and does not affect program behavior." There are 3 choices here: 1: no prefetch. D$ miss will happen on each packet, time cost depends on where data sits(close or far) and burst size. 2: prefetch with loop end check to avoid random address. Pro is free of TLB miss per burst, Cons is "if" instruction per packet. Cost depends on burst size. 3: brute force prefetch, cost is TLB miss, but no addtional instructions per packet. Not sure how random the last address could be in testpmd and how many TLB miss could happen. Based on my expericen of performance optimization, IIRC, option 3 has the best performance. But for this case, result depends on how many sub-burst inside and how sub-burst get processed, maybe callback will flush prefetch data completely or not. So it's hard to get a conclusion, what I said is that the code in PMD driver should have a reason. On the other hand, the latency and throughput saving of this featue on multiple ports is huge, I perfer to down play this prefetch discussion if you agree.
quoted
quoted
quoted
Based on the patch, I believe the impact has been minimized.quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
Overall, is this for optimizing memory for the port represontors? If so can't we have a port representor specific solution, reducing scope can reduce the complexity it brings?quoted
quoted
If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing itsname andquoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
scope.It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.quoted
quoted
quoted
quoted
Signed-off-by: Xueming Li [off-list ref] --- doc/guides/nics/features.rstquoted
11 +++++++++++doc/guides/nics/features/default.iniquoted
1 +doc/guides/prog_guide/switch_representat ion. rst | 10 ++++++++++ lib/ethdev/rte_ethdev.cquoted
1 +lib/ethdev/rte_ethdev.hquoted
7 +++++++5 files changed, 30 insertions(+)diff --git a/doc/guides/nics/features.rstb/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644--- a/doc/guides/nics/features.rst +++ b/doc/guides/nics/features.rst@@ -624,6 +624,17 @@ Supports innerpacket L4 checksum. ``tx_offload_capa,tx_queue_offload_cap a:DE V_TX_OFFLOAD_OUTER_UDP_CKSUM``. +.. _nic_features_shared_rx_queue: + +Shared Rx queue +--------------- + +Supports shared Rx queue for ports in same switch domain. + +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ` `. +* **[provides] mbuf**: ``mbuf.port``. + + .. _nic_features_packet_type_parsing: Packet type parsing diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini index 754184ddd4..ebeb4c1851 100644 --- a/doc/guides/nics/features/default.ini +++ b/doc/guides/nics/features/default.ini@@ -19,6 +19,7 @@ Free Tx mbuf on demand= Queue start/stop = Runtime Rx queue setup = Runtime Tx queue setup = +Shared Rx queue = Burst mode info = Power mgmt address monitor = MTU update = diff --git a/doc/guides/prog_guide/switch_representa tion .rst b/doc/guides/prog_guide/switch_representa tion .rst index ff6aa91c80..45bf5a3a10 100644 --- a/doc/guides/prog_guide/switch_representa tion .rst +++ b/doc/guides/prog_guide/switch_representa tion .rst@@ -123,6 +123,16 @@ thought as asoftware "patch panel" front-end for applications. .. [1] `Ethernet switch device driver model (switchdev) <https://www.kernel.org/doc/Documentation /net working/switchdev.txtquoted
`_+- Memory usage of representors is huge when number of representor +grows, + because PMD always allocate mbuf for each descriptor of Rx queue. + Polling the large number of ports brings more CPU load, cache +miss and + latency. Shared Rx queue can be used to share Rx queue between +PF and + representors in same switch domain. +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ`` + is present in Rx offloading capability of device info. Setting +the + offloading flag in device Rx mode or Rx queue configuration to +enable + shared Rx queue. Polling any member port of shared Rx queue can +return + packets of all ports in group, port ID is saved in ``mbuf.port``. + Basic SR-IOV ------------diff --git a/lib/ethdev/rte_ethdev.cb/lib/ethdev/rte_ethdev.c index 9d95cd11e1..1361ff759a 100644--- a/lib/ethdev/rte_ethdev.c +++ b/lib/ethdev/rte_ethdev.c@@ -127,6 +127,7 @@ static const struct {RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_ CKSU M), RTE_RX_OFFLOAD_BIT2STR(RSS_HASH), RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER _SPL IT), + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ), }; #undef RTE_RX_OFFLOAD_BIT2STRdiff --git a/lib/ethdev/rte_ethdev.hb/lib/ethdev/rte_ethdev.h index d2b27c351f..a578c9db9d 100644--- a/lib/ethdev/rte_ethdev.h +++ b/lib/ethdev/rte_ethdev.h@@ -1047,6 +1047,7 @@ structrte_eth_rxconf { uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */ uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */ uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */ + uint32_t shared_group; /**< Shared port group index in + switch domain. */ /** * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags. * Only offloads set on rx_queue_offload_capa or rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000 #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000 +/** + * Rx queue is shared among ports in same switch domain to save +memory, + * avoid polling each port. Any port in group can be used to receive packets. + * Real source port number saved in mbuf-quoted
port field.+ */ +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \ DEV_RX_O FFLO AD_UDP_CKSUM | \ -- 2.25.1