[dpdk-dev] 回复: [RFC PATCH v1 4/6] app/eventdev: add release barriers for... | dpdk-dev

[dpdk-dev] 回复: [RFC PATCH v1 4/6] app/eventdev: add release barriers for pipeline test

From: Feifei Wang <hidden>
Date: 2021-01-08 07:13:09

Hi, Pavan

-----邮件原件-----

发件人: Pavan Nikhilesh Bhagavatula [off-list ref]

发送时间: 2021年1月5日 17:29

收件人: Feifei Wang [off-list ref]; jerinj@marvell.com; Harry

van Haaren [off-list ref]

抄送: dev@dpdk.org; nd [off-list ref]; Honnappa Nagarahalli

[off-list ref]; stable@dpdk.org; Ruifeng Wang

[off-list ref]; nd [off-list ref]

主题: RE: [RFC PATCH v1 4/6] app/eventdev: add release barriers for pipeline

test

Hi Feifei,

quoted
Hi, Pavan

quoted

quoted

Sorry for my late reply and thanks very much for your review.

quoted

quoted
quoted
-----Original Message-----

quoted

From: Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com<mailto:pbhagavatula@marvell.com>>

quoted

Sent: 2020年12月22日 18:33

quoted

To: Feifei Wang <Feifei.Wang2@arm.com<mailto:Feifei.Wang2@arm.com>>; jerinj@marvell.com<mailto:jerinj@marvell.com>;

quoted
Harry van

quoted

Haaren <harry.van.haaren@intel.com<mailto:harry.van.haaren@intel.com>>; Pavan Nikhilesh

quoted

<pbhagavatula@caviumnetworks.com<mailto:pbhagavatula@caviumnetworks.com>>

quoted

Cc: dev@dpdk.org<mailto:dev@dpdk.org>; nd <nd@arm.com<mailto:nd@arm.com>>; Honnappa Nagarahalli

quoted

<Honnappa.Nagarahalli@arm.com<mailto:Honnappa.Nagarahalli@arm.com>>; stable@dpdk.org<mailto:stable@dpdk.org>; Phil Yang

quoted

<Phil.Yang@arm.com<mailto:Phil.Yang@arm.com>>

quoted

Subject: RE: [RFC PATCH v1 4/6] app/eventdev: add release barriers

quoted
for

quoted
quoted
pipeline test

quoted
quoted

quoted

Add release barriers before updating the processed packets for

quoted
worker

quoted

lcores to ensure the worker lcore has really finished data

quoted

processing and then it can update the processed packets number.

quoted
quoted
quoted

quoted
quoted

quoted

I believe we can live with minor inaccuracies in stats being

quoted
quoted
presented

quoted
as

quoted

atomics are pretty heavy when scheduler is limited to burst size as 1.

quoted
quoted

quoted

One option is to move it before a pipeline operation

quoted

(pipeline_event_tx,

quoted

pipeline_fwd_event etc.) as they imply implicit release barrier (as

quoted
quoted
all

quoted
the

quoted

changes done to the event should be visible to the next core).

quoted

quoted

If I understand correctly, your meaning is that move release barriers

quoted

before pipeline_event_tx or pipeline_fwd_event. This can ensure the

quoted

event has been processed before the next core begins to tx/fwd. For

quoted
example:

What I meant was event APIs such as `rte_event_enqueue_burst`,

`rte_event_eth_tx_adapter_enqueue`

act as an implicit release barrier and the API `rte_event_dequeue_burst` act

as an implicit acquire barrier.

Since, pipeline_* test starts with a dequeue() and ends with an enqueue() I

don’t believe we need barriers in Between.



Sorry for my misunderstanding this. And I agree with you that no barriers are

needed between dequeue and enqueue.



Now, let's go back to the beginning. Actually with this patch, our barrier is mainly

for the synchronous variable " w->processed_pkts ". As we all know, the event is firstly

dequeued and then enqueued, after this, the event can be treated as the processed event

and included in the statistics("w->processed_pkts++").



Thus, we add a release barrier before " w->processed_pkts++" is to prevent this operation

being executed ahead of time. For example:

dequeue  ->  w->processed_pkts++  ->  enqueue

This cause that the worker doesn't actually finish this event processing, but the event is treated

as the processed one and included in the statistics.



______________________________________________________________________________



By the way, I have two other questions about pipeline process test in "test_pipeline_queue".

1. when do we start counting processed events (w->processed_pkts)?

For the fwd mode (internal_port = false), when we choose single stage, application increments

the number events processed after "pipeline_event enqueue". However, when we choose multiple

stage, application increments the number events processed before "pipleline_event_enqueue". So,

maybe we can unify this. For example of multiple stage:



                                if (cq_id == last_queue) {

                                                ev.queue_id = tx_queue[ev.mbuf->port];

                                                rte_event_eth_tx_adapter_txq_set(ev.mbuf, 0);

                                                pipeline_fwd_event(&ev, RTE_SCHED_TYPE_ATOMIC);

                                +             pipeline_event_enqueue(dev, port, &ev);

                                                w->processed_pkts++;

                                } else {

                                                ev.queue_id++;

                                                pipeline_fwd_event(&ev, sched_type_list[cq_id]);

                                +             pipeline_event_enqueue(dev, port, &ev);

                                }



                -              pipeline_event_enqueue(dev, port, &ev);



2. Whether  "pipeline_event_enqueue" is needed after "pipeline_event_tx" for tx mode?

For single_stage_burst_tx mode, after "pipeline_event_tx", the worker has to enqueue again

due to  "pipeline_event_enqueue_burst", so maybe we should jump out of the loop after

“pipeline_event_tx”, for example:



                                                if (ev[i].sched_type == RTE_SCHED_TYPE_ATOMIC) {

                                                                pipeline_event_tx(dev, port, &ev[i]);

                                                                ev[i].op = RTE_EVENT_OP_RELEASE;

                                                                w->processed_pkts++;

                                                +             continue;

                                                } else {

                                                                ev[i].queue_id++;

                                                                pipeline_fwd_event(&ev[i],

                                                                                                RTE_SCHED_TYPE_ATOMIC);

                                                }

                                }



                                pipeline_event_enqueue_burst(dev, port, ev, nb_rx);

quoted

quoted

if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) {

quoted

                         +             __atomic_thread_fence(__ATOMIC_RELEASE);

quoted

                                         pipeline_event_tx(dev, port, &ev);

quoted

                                         w->processed_pkts++;

quoted

                         } else {

quoted

                                         ev.queue_id++;

quoted

                         +             __atomic_thread_fence(__ATOMIC_RELEASE);

quoted

                                         pipeline_fwd_event(&ev,

quoted

RTE_SCHED_TYPE_ATOMIC);

quoted

                                         pipeline_event_enqueue(dev, port, &ev);

quoted

quoted

However, there are two reasons to prevent this:

quoted

quoted

First, compare with other tests in app/eventdev, for example, the

quoted

eventdev perf test, the wmb is after event operation to ensure

quoted

operation has been finished and then w->processed_pkts++.

In case of perf_* tests start with a dequeue() and finally ends with a

mempool_put() should also act as implicit acquire release pairs making stats

consistent?



For perf tests, this consistency refers to that there is a wmb after mempool_put().

Please refer to this link:

http://patches.dpdk.org/patch/85634/

quoted

So, if we move release barriers before tx/fwd, it may cause that the

quoted

tests of app/eventdev become  inconsistent.This may reduce the

quoted

maintainability of the code and make it difficult to understand.

quoted

quoted

Second, it is a test case, though heavy thread may cause performance

quoted

degradation, it can ensure that the operation process and the test

quoted

result are correct. And maybe for a test case, correctness is more

quoted

important than performance.

quoted

Most of our internal perf test run on 24/48 core combinations and since

Octeontx2 event device driver supports a burst size of 1, it will show up as

Huge performance degradation.



For the impact on performance, I do the test using software driver, following are some test results:

------------------------------------------------------------------------------------------------------------------------------------

Architecture: aarch64

Nics: ixgbe-82599

CPU: Cortex-A72

BURST_SIZE: 1

Order: ./dpdk-test-eventdev -l 0-15 -s 0x2 --vdev=event_sw0 -- --test=pipeline_queue --wlcore=4-14 --prod_type_ethdev --stlist=a,a

Flow: one flow, 64bits package, TX rate: 1.4Mpps



Without this patch:

0.954 mpps avg 0.953 mpps



With this patch:

0.932 mpps avg 0.930 mpps

------------------------------------------------------------------------------------------------------------------------------------



Based on the result above, there is no significant performance degradation with this patch.

This is because the release barrier is only for  “w->processed_pkts++”. It just ensures that the worker core

increments the number events processed after enqueue, and it doesn’t affect dequeue/enqueue:



dequeue -> enqueue -> release barrier -> w->processed_pkts++



On the other hand, I infer the reason for the slight decrease in measurement performance is that the release barrier

prevent “w->processed_pkts++” before that the event has been processed (enqueue). But I think this test result is closer

to the real performance.

And sorry for that we have no octentx2 device, so there is no test result on Octeontx2 event device driver. Would you please

help us test this patch on octentx2 when you are convenient. Thanks very much.



Best Regards

Feifei

quoted

So, due to two reasons above, I'm ambivalent about how we should do in

quoted

the next step.

quoted

quoted
Best Regards

quoted
Feifei

Regards,

Pavan.

quoted

quoted

Fixes: 314bcf58ca8f ("app/eventdev: add pipeline queue worker

quoted
quoted
quoted
functions")

quoted

Cc: pbhagavatula@marvell.com<mailto:pbhagavatula@marvell.com>

quoted

Cc: stable@dpdk.org<mailto:stable@dpdk.org>

quoted
quoted
quoted

quoted

Signed-off-by: Phil Yang <phil.yang@arm.com<mailto:phil.yang@arm.com>>

quoted

Signed-off-by: Feifei Wang <feifei.wang2@arm.com<mailto:feifei.wang2@arm.com>>

quoted

Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com<mailto:ruifeng.wang@arm.com>>

quoted
quoted
quoted
---

quoted

app/test-eventdev/test_pipeline_queue.c | 64

quoted
quoted
quoted
+++++++++++++++++++++----

quoted

1 file changed, 56 insertions(+), 8 deletions(-)

quoted
quoted
quoted

quoted

diff --git a/app/test-eventdev/test_pipeline_queue.c b/app/test-

quoted

eventdev/test_pipeline_queue.c index 7bebac34f..0c0ec0ceb

quoted
100644

quoted

--- a/app/test-eventdev/test_pipeline_queue.c

quoted

+++ b/app/test-eventdev/test_pipeline_queue.c

quoted

@@ -30,7 +30,13 @@ pipeline_queue_worker_single_stage_tx(void

quoted
quoted
quoted
*arg)

quoted
quoted
quoted

quoted

                   if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) {

quoted

                                   pipeline_event_tx(dev, port, &ev);

quoted

-                                  w->processed_pkts++;

quoted
quoted
quoted
+

quoted

+                                 /* release barrier here ensures stored operation

quoted

+                                 * of the event completes before the number of

quoted

+                                 * processed pkts is visible to the main core

quoted
quoted
quoted
+                                 */

quoted

+                                 __atomic_fetch_add(&(w->processed_pkts), 1,

quoted

+                                                                 __ATOMIC_RELEASE);

quoted

                   } else {

quoted

                                   ev.queue_id++;

quoted

                                   pipeline_fwd_event(&ev,

quoted
quoted
quoted
RTE_SCHED_TYPE_ATOMIC);

quoted
quoted
quoted
@@ -59,7 +65,13 @@

quoted

pipeline_queue_worker_single_stage_fwd(void

quoted
quoted
quoted
*arg)

quoted

                   rte_event_eth_tx_adapter_txq_set(ev.mbuf, 0);

quoted

                   pipeline_fwd_event(&ev, RTE_SCHED_TYPE_ATOMIC);

quoted

                   pipeline_event_enqueue(dev, port, &ev);

quoted
quoted
quoted
-                  w->processed_pkts++;

quoted
quoted
quoted
+

quoted

+                 /* release barrier here ensures stored operation

quoted

+                 * of the event completes before the number of

quoted

+                 * processed pkts is visible to the main core

quoted
quoted
quoted
+                 */

quoted

+                 __atomic_fetch_add(&(w->processed_pkts), 1,

quoted

+                                                 __ATOMIC_RELEASE);

quoted
quoted
quoted
   }

quoted
quoted
quoted

quoted
quoted
quoted
   return 0;

quoted
quoted
quoted
@@ -84,7 +96,13 @@

quoted

pipeline_queue_worker_single_stage_burst_tx(void *arg)

quoted

                                   if (ev[i].sched_type ==

quoted
quoted
quoted
RTE_SCHED_TYPE_ATOMIC) {

quoted

                                                   pipeline_event_tx(dev, port, &ev[i]);

quoted

                                                   ev[i].op = RTE_EVENT_OP_RELEASE;

quoted

-                                                  w->processed_pkts++;

quoted
quoted
quoted
+

quoted

+                                                 /* release barrier here ensures stored

quoted
quoted
quoted
operation

quoted

+                                                 * of the event completes before the

quoted
quoted
quoted
number of

quoted

+                                                 * processed pkts is visible to the main

quoted
quoted
quoted
core

quoted

+                                                 */

quoted

+                                                 __atomic_fetch_add(&(w-

quoted
quoted
quoted
quoted
processed_pkts), 1,

quoted

+                                                                                 __ATOMIC_RELEASE);

quoted

                                   } else {

quoted

                                                   ev[i].queue_id++;

quoted

                                                   pipeline_fwd_event(&ev[i],

quoted
quoted
quoted
@@ -121,7 +139,13 @@

quoted

pipeline_queue_worker_single_stage_burst_fwd(void *arg)

quoted

quoted
quoted
quoted

quoted

                   pipeline_event_enqueue_burst(dev, port, ev, nb_rx);

quoted

-                  w->processed_pkts += nb_rx;

quoted
quoted
quoted
+

quoted

+                 /* release barrier here ensures stored operation

quoted

+                 * of the event completes before the number of

quoted

+                 * processed pkts is visible to the main core

quoted
quoted
quoted
+                 */

quoted

+                 __atomic_fetch_add(&(w->processed_pkts), nb_rx,

quoted

+                                                 __ATOMIC_RELEASE);

quoted
quoted
quoted
   }

quoted
quoted
quoted

quoted
quoted
quoted
   return 0;

quoted
quoted
quoted
@@ -146,7 +170,13 @@

quoted

pipeline_queue_worker_multi_stage_tx(void

quoted
quoted
quoted
*arg)

quoted
quoted
quoted

quoted

                   if (ev.queue_id == tx_queue[ev.mbuf->port]) {

quoted

                                   pipeline_event_tx(dev, port, &ev);

quoted

-                                  w->processed_pkts++;

quoted
quoted
quoted
+

quoted

+                                 /* release barrier here ensures stored operation

quoted

+                                 * of the event completes before the number of

quoted

+                                 * processed pkts is visible to the main core

quoted
quoted
quoted
+                                 */

quoted

+                                 __atomic_fetch_add(&(w->processed_pkts), 1,

quoted

+                                                                 __ATOMIC_RELEASE);

quoted

                                   continue;

quoted

quoted
quoted
quoted

quoted
quoted
quoted
@@ -180,7 +210,13 @@

quoted

pipeline_queue_worker_multi_stage_fwd(void *arg)

quoted

                                   ev.queue_id = tx_queue[ev.mbuf->port];

quoted

                                   rte_event_eth_tx_adapter_txq_set(ev.mbuf, 0);

quoted

                                   pipeline_fwd_event(&ev,

quoted
quoted
quoted
RTE_SCHED_TYPE_ATOMIC);

quoted

-                                  w->processed_pkts++;

quoted
quoted
quoted
+

quoted

+                                 /* release barrier here ensures stored operation

quoted

+                                 * of the event completes before the number of

quoted

+                                 * processed pkts is visible to the main core

quoted
quoted
quoted
+                                 */

quoted

+                                 __atomic_fetch_add(&(w->processed_pkts), 1,

quoted

+                                                                 __ATOMIC_RELEASE);

quoted

                   } else {

quoted

                                   ev.queue_id++;

quoted

                                   pipeline_fwd_event(&ev,

quoted
quoted
quoted
sched_type_list[cq_id]);

quoted
quoted
quoted
@@ -214,7 +250,13 @@

quoted

pipeline_queue_worker_multi_stage_burst_tx(void *arg)

quoted

                                   if (ev[i].queue_id == tx_queue[ev[i].mbuf-

quoted
quoted
quoted
quoted
port]) {

quoted

                                                   pipeline_event_tx(dev, port, &ev[i]);

quoted

                                                   ev[i].op = RTE_EVENT_OP_RELEASE;

quoted

-                                                  w->processed_pkts++;

quoted
quoted
quoted
+

quoted

+                                                 /* release barrier here ensures stored

quoted
quoted
quoted
operation

quoted

+                                                 * of the event completes before the

quoted
quoted
quoted
number of

quoted

+                                                 * processed pkts is visible to the main

quoted
quoted
quoted
core

quoted

+                                                 */

quoted

+                                                 __atomic_fetch_add(&(w-

quoted
quoted
quoted
quoted
processed_pkts), 1,

quoted

+                                                                                 __ATOMIC_RELEASE);

quoted

                                                   continue;

quoted

quoted
quoted
quoted

quoted
quoted
quoted
@@ -254,7 +296,13 @@

quoted

pipeline_queue_worker_multi_stage_burst_fwd(void *arg)

quoted
quoted
quoted

quoted

   rte_event_eth_tx_adapter_txq_set(ev[i].mbuf, 0);

quoted

                                                   pipeline_fwd_event(&ev[i],

quoted
quoted
quoted

quoted
quoted
quoted
   RTE_SCHED_TYPE_ATOMIC);

quoted

-                                                  w->processed_pkts++;

quoted
quoted
quoted
+

quoted

+                                                 /* release barrier here ensures stored

quoted
quoted
quoted
operation

quoted

+                                                 * of the event completes before the

quoted
quoted
quoted
number of

quoted

+                                                 * processed pkts is visible to the main

quoted
quoted
quoted
core

quoted

+                                                 */

quoted

+                                                 __atomic_fetch_add(&(w-

quoted
quoted
quoted
quoted
processed_pkts), 1,

quoted

+                                                                                 __ATOMIC_RELEASE);

quoted

                                   } else {

quoted

                                                   ev[i].queue_id++;

quoted

                                                   pipeline_fwd_event(&ev[i],

quoted
quoted
quoted
--

quoted
quoted
quoted
2.17.1

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help