Thread (16 messages) 16 messages, 3 authors, 2021-07-13

Re: [dpdk-dev] [PATCH 1/2] net/mlx5: remove redundant operations

From: Ruifeng Wang <hidden>
Date: 2021-07-07 08:00:18

-----Original Message-----
From: Slava Ovsiienko <redacted>
Sent: Monday, July 5, 2021 6:02 PM
To: Ruifeng Wang <redacted>; Raslan Darawsheh
[off-list ref]; Matan Azrad [off-list ref]; Shahaf Shuler
[off-list ref]
Cc: dev@dpdk.org; jerinj@marvell.com; nd <redacted>; Honnappa
Nagarahalli [off-list ref]; stable@dpdk.org; nd
[off-list ref]
Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations

Hi, Ruifeng

The invalid_mask is used to set error flags and calculate the statistics.
So, all the CQEs the first one with error or invalid status should be masked
out (and the CQEs after that).
Now I understand it. What I was missing is inconsecutive mask bits.
Thanks for your patience.
I'll update in next version.
quoted hunk ↗ jump to hunk
IMO, what we could improve (apply just the part of the patch below):
quoted
quoted
quoted
quoted
index 2234fbe6b2..98a75b09c6 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -768,18 +768,11 @@  rxq_cq_process_v(struct mlx5_rxq_data *rxq,
volatile struct mlx5_cqe *cq,
 					  comp_mask), 0)) /
 					  (sizeof(uint16_t) * 8);
 		/* D.6 mask out entries after the compressed CQE. */
-		mask = vcreate_u16(comp_idx <
MLX5_VPMD_DESCS_PER_LOOP ?
-				   -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
-				   0);
-		invalid_mask = vorr_u16(invalid_mask, mask);
+		invalid_mask = vorr_u16(invalid_mask, comp_mask);
 		/* D.7 count non-compressed valid CQEs. */
 		n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
 				   invalid_mask), 0)) / (sizeof(uint16_t) * 8);
 		nocmp_n += n;
<<<<

And that's it. The rest of the patch:
quoted
quoted
quoted
quoted
-		/* D.2 get the final invalid mask. */
-		mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
-				   -1UL >> (n * sizeof(uint16_t) * 8) : 0);
-		invalid_mask = vorr_u16(invalid_mask, mask);
<<<<
Should not be applied, otherwise the following might be affected:

opcode = vbic_u16(opcode, invalid_mask); ...
opcode = vbic_u16(opcode, invalid_mask);

With best regards,
Slava
quoted
-----Original Message-----
From: Ruifeng Wang <redacted>
Sent: Friday, July 2, 2021 13:30
To: Slava Ovsiienko <redacted>; Raslan Darawsheh
[off-list ref]; Matan Azrad [off-list ref]; Shahaf Shuler
[off-list ref]
Cc: dev@dpdk.org; jerinj@marvell.com; nd <redacted>; Honnappa
Nagarahalli [off-list ref]; stable@dpdk.org; nd
[off-list ref]
Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations
quoted
-----Original Message-----
From: Slava Ovsiienko <redacted>
Sent: Friday, July 2, 2021 4:13 PM
To: Ruifeng Wang <redacted>; Raslan Darawsheh
[off-list ref]; Matan Azrad [off-list ref]; Shahaf
Shuler
quoted
quoted
[off-list ref]
Cc: dev@dpdk.org; jerinj@marvell.com; nd <redacted>; Honnappa
Nagarahalli [off-list ref]; stable@dpdk.org
Subject: RE: [PATCH 1/2] net/mlx5: remove redundant operations

Hi, Ruifeng
Hi, Slava
quoted
quoted
-----Original Message-----
From: Ruifeng Wang <redacted>
Sent: Tuesday, June 1, 2021 11:31
To: Raslan Darawsheh <redacted>; Matan Azrad
[off-list ref]; Shahaf Shuler [off-list ref]; Slava
Ovsiienko [off-list ref]
Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
honnappa.nagarahalli@arm.com; Ruifeng Wang
[off-list ref];
quoted
quoted
stable@dpdk.org
Subject: [PATCH 1/2] net/mlx5: remove redundant operations

Some operations on mask are redundant and can be removed.
The change yielded 1.6% performance gain on N1SDP.
On ThunderX2, slight performance uplift was also observed.

Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for
ARM")
Cc: stable@dpdk.org

Signed-off-by: Ruifeng Wang <redacted>
---
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
index 2234fbe6b2..98a75b09c6 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -768,18 +768,11 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
volatile struct mlx5_cqe *cq,
 					  comp_mask), 0)) /
 					  (sizeof(uint16_t) * 8);
 		/* D.6 mask out entries after the compressed CQE. */
-		mask = vcreate_u16(comp_idx <
MLX5_VPMD_DESCS_PER_LOOP ?
-				   -1UL >> (comp_idx * sizeof(uint16_t) * 8) :
-				   0);
-		invalid_mask = vorr_u16(invalid_mask, mask);
+		invalid_mask = vorr_u16(invalid_mask, comp_mask);
Mmmm... I'm not sure we can drop the masking compressed (and
following) CQE skip.
Let's consider the completion scenario (the series of 4 CQEs, each
element is 64B long)

0: normal uncompressed CQE, ownership OK, format uncompressed,
opcode OK, no error
1: compressed CQE, ownership OK, format compressed, opcode OK, no
error
2: miniCQE array, format can be any!!, may be discovered as
ownership OK, format uncompressed, opcode OK, no error
3: miniCQE array, format can be any!!, may be discovered as
ownership OK, format uncompressed, opcode OK, no error
Thanks for your review and explanation about CQE processing details.
I did the change based on the fact that some calculations doesn't
change the data.
So some intermediate calculations were removed.

In the above diff section, result of 'mask' always equals to the
nearest 'comp_mask' that above it.
So I just remoed 'mask' and use 'comp_mask' instead.
quoted
Obviously, we should unconditionally mask out 2 and 3, regardless of
recognized their formats/opcode/error/etc.
I think we can get the diff above and skip diff below:
quoted
 		/* D.7 count non-compressed valid CQEs. */
 		n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16(
 				   invalid_mask), 0)) / (sizeof(uint16_t) * 8);
 		nocmp_n += n;
-		/* D.2 get the final invalid mask. */
-		mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ?
-				   -1UL >> (n * sizeof(uint16_t) * 8) : 0);
-		invalid_mask = vorr_u16(invalid_mask, mask);
and get the correct final invalid_mask - all compressed and invalid
CQEs and following ones will be masked out.
This diff section is similar to the previous one.
'mask' always equals to the nearest 'invalid_mask' that above it.
So entire line "invalid_mask = vorr_u16(invalid_mask, mask);" can be
removed.
quoted
Code logic is not changed. But I'm not sure the code change impacts
readability or maintainability that you may concern.

Thanks.
quoted
With best regards,
Slava
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help