Thread (22 messages) 22 messages, 5 authors, 2021-07-13

Re: [dpdk-dev] [PATCH v1] net/i40e: remove the SMP barrier in HW scanning func

From: Zhang, Qi Z <hidden>
Date: 2021-06-16 13:29:36

Hi
-----Original Message-----
From: Honnappa Nagarahalli <redacted>
Sent: Tuesday, June 8, 2021 5:36 AM
To: Zhang, Qi Z <redacted>; Joyce Kong <redacted>;
Xing, Beilei [off-list ref]; Ruifeng Wang [off-list ref]
Cc: dev@dpdk.org; nd <redacted>; Honnappa Nagarahalli
[off-list ref]; nd [off-list ref]
Subject: RE: [PATCH v1] net/i40e: remove the SMP barrier in HW scanning
func

<snip>
quoted
quoted
quoted
quoted
Add the logic to determine how many DD bits have been set for
contiguous packets, for removing the SMP barrier while reading descs.
I didn't understand this.
The current logic already guarantee the read out DD bits are from
continue packets, as it read Rx descriptor in a reversed order
from the
ring.
quoted
Qi, the comments in the code mention that there is a race condition
if the descriptors are not read in the reverse order. But, they do
not mention what the race condition is and how it can occur.
Appreciate if you could explain that.
The Race condition happens between the NIC and CPU, if write and read
DD bit in the same order, there might be a hole (e.g. 1011)  with the
reverse read order, we make sure no more "1" after the first "0"
as the read address are declared as volatile, compiler will not
re-ordered them.
My understanding is that

1) the NIC will write an entire cache line of descriptors to memory "atomically"
(i.e. the entire cache line is visible to the CPU at once) if there are enough
descriptors ready to fill one cache line.
2) But, if there are not enough descriptors ready (because for ex: there is not
enough traffic), then it might write partial cache lines.
Yes, for example a cache line contains 4 x16 bytes descriptors and it is possible we get 1 1 1 0 for DD bit at some moment.
Please correct me if I am wrong.

For #1, I do not think it matters if we read the descriptors in reverse order or
not as the cache line is written atomically.
I think below cases may happens if we don't read in reserve order.

1. CPU get first cache line as 1 1 1 0 in a loop
2. new packets coming and NIC append last 1 to the first cache and a new cache line with 1 1 1 1.
3. CPU continue new cache line with 1 1 1 1 in the same loop, but the last 1 of first cache line is missed, so finally it get 1 1 1 0 1 1 1 1. 

For #1, if we read in reverse order, does it make sense to not check the DD bits
of descriptors that are earlier in the order once we encounter a descriptor that
has its DD bit set? This is because NIC updates the descriptors in order.
I think the answer is yes, when we met the first DD bit, we should able to calculated the exact number base on the index, but not sure how much performance gain.

quoted
quoted
On x86, the reads are not re-ordered (though the compiler can
re-order). On ARM, the reads can get re-ordered and hence the
barriers are required. In order to avoid the barriers, we are trying
to process only those descriptors whose DD bits are set such that
they are contiguous. i.e. if the DD bits are 1011, we process only the first
descriptor.
quoted
Ok, I see. thanks for the explanation.
At this moment, I may prefer not change the behavior of x86, so
compile option for arm can be added, in future when we observe no
performance impact for x86 as well, we can consider to remove it, what do
you think?
I am ok with this approach.
quoted
quoted
quoted
So I didn't see the a new logic be added, would you describe more
clear about the purpose of this patch?
quoted
Signed-off-by: Joyce Kong <redacted>
Reviewed-by: Ruifeng Wang <redacted>
---
 drivers/net/i40e/i40e_rxtx.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/drivers/net/i40e/i40e_rxtx.c
b/drivers/net/i40e/i40e_rxtx.c index
6c58decec..410a81f30 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -452,7 +452,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue
*rxq)
quoted
quoted
quoted
 	uint16_t pkt_len;
 	uint64_t qword1;
 	uint32_t rx_status;
-	int32_t s[I40E_LOOK_AHEAD], nb_dd;
+	int32_t s[I40E_LOOK_AHEAD], var, nb_dd;
 	int32_t i, j, nb_rx = 0;
 	uint64_t pkt_flags;
 	uint32_t *ptype_tbl = rxq->vsi->adapter->ptype_tbl; @@ -482,11
+482,14 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 					I40E_RXD_QW1_STATUS_SHIFT;
 		}

-		rte_smp_rmb();
Any performance gain by removing this? and it is not necessary to
be combined with below change, right?
quoted
-
 		/* Compute how many status bits were set */
-		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++)
-			nb_dd += s[j] & (1 <<
I40E_RX_DESC_STATUS_DD_SHIFT);
quoted
+		for (j = 0, nb_dd = 0; j < I40E_LOOK_AHEAD; j++) {
+			var = s[j] & (1 << I40E_RX_DESC_STATUS_DD_SHIFT);
+			if (var)
+				nb_dd += 1;
+			else
+				break;
+		}

 		nb_rx += nb_dd;

--
2.17.1
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help