Thread (16 messages) 16 messages, 3 authors, 2021-07-01

Re: [dpdk-dev] [PATCH v1 1/2] net/i40e: improve performance for scalar Tx

From: Xing, Beilei <hidden>
Date: 2021-06-28 02:27:25

-----Original Message-----
From: Feifei Wang <redacted>
Sent: Friday, June 25, 2021 5:40 PM
To: Xing, Beilei <redacted>
Cc: dev@dpdk.org; nd <redacted>; Ruifeng Wang
[off-list ref]; nd [off-list ref]; nd [off-list ref]
Subject: 回复: [PATCH v1 1/2] net/i40e: improve performance for scalar Tx

<snip>
quoted
quoted
int n = txq->tx_rs_thresh;
 int32_t i = 0, j = 0;
const int32_t k = RTE_ALIGN_FLOOR(n, RTE_I40E_TX_MAX_FREE_BUF_SZ);
const int32_t m = n % RTE_I40E_TX_MAX_FREE_BUF_SZ; struct rte_mbuf
*free[RTE_I40E_TX_MAX_FREE_BUF_SZ];

For FAST_FREE_MODE:

if (k) {
 	for (j = 0; j != k - RTE_I40E_TX_MAX_FREE_BUF_SZ;
 			j += RTE_I40E_TX_MAX_FREE_BUF_SZ) {
		for (i = 0; i <RTE_I40E_TX_MAX_FREE_BUF_SZ; ++i, ++txep) {
			free[i] = txep->mbuf;
			txep->mbuf = NULL;
		}
 		rte_mempool_put_bulk(free[0]->pool, (void **)free,
 					RTE_I40E_TX_MAX_FREE_BUF_SZ);
 	}
 }

if (m) {
 	for (i = 0; i < m; ++i, ++txep) {
		free[i] = txep->mbuf;
 		txep->mbuf = NULL;
	}
 }
 rte_mempool_put_bulk(free[0]->pool, (void **)free, m); }
quoted
Seems no logical problem, but the code looks heavy due to for loops.
Did you run performance with this change when tx_rs_thresh >
RTE_I40E_TX_MAX_FREE_BUF_SZ?
Sorry for my late rely. It takes me some time to do the test for this path and
following is my test results:

First, I come up with another way to solve this bug and compare it with
"loop"(size of 'free' is 64).
That is set the size of 'free' as a large constant. We know:
tx_rs_thresh < ring_desc_size < I40E_MAX_RING_DESC(4096), so we can
directly define as:
struct rte_mbuf *free[RTE_I40E_TX_MAX_FREE_BUF_SZ];

[1]Test Config:
MRR Test: two porst & bi-directional flows & one core RX API:
i40e_recv_pkts_bulk_alloc TX API: i40e_xmit_pkts_simple
ring_descs_size: 1024
Ring_I40E_TX_MAX_FREE_SZ: 64

[2]Scheme:
tx_rs_thresh =  I40E_DEFAULT_TX_RSBIT_THRESH tx_free_thresh =
I40E_DEFAULT_TX_FREE_THRESH tx_rs_thresh <= tx_free_thresh <
nb_tx_desc So we change the value of 'tx_rs_thresh' by adjust
I40E_DEFAULT_TX_RSBIT_THRESH

[3]Test Results (performance improve):
In X86:
tx_rs_thresh/ tx_free_thresh                       32/32          256/256          512/512
1.mempool_put(base)                                   0                  0                        0
2.mempool_put_bulk:loop                           +4.7%         +5.6%               +7.0%
3.mempool_put_bulk:large size for free   +3.8%          +2.3%               -2.0%
(free[I40E_MAX_RING_DESC])

In Arm:
N1SDP:
tx_rs_thresh/ tx_free_thresh                       32/32          256/256          512/512
1.mempool_put(base)                                   0                  0                        0
2.mempool_put_bulk:loop                           +7.9%         +9.1%               +2.9%
3.mempool_put_bulk:large size for free    +7.1%         +8.7%               +3.4%
(free[I40E_MAX_RING_DESC])

Thunderx2:
tx_rs_thresh/ tx_free_thresh                       32/32          256/256          512/512
1.mempool_put(base)                                   0                  0                        0
2.mempool_put_bulk:loop                           +7.6%         +10.5%             +7.6%
3.mempool_put_bulk:large size for free    +1.7%         +18.4%             +10.2%
(free[I40E_MAX_RING_DESC])

As a result, I feel maybe 'loop' is better and it seems not very heavy
according to the test.
What about your views and look forward to your reply.
Thanks a lot.
Thanks for your patch and test.
It looks OK for me, please send V2.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help