Thread (12 messages) 12 messages, 4 authors, 2016-07-23

Re: [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

From: Ananyev, Konstantin <hidden>
Date: 2016-07-23 12:32:04

-----Original Message-----
From: Jerin Jacob [mailto:jerin.jacob@caviumnetworks.com]
Sent: Saturday, July 23, 2016 12:49 PM
To: Ananyev, Konstantin <redacted>
Cc: Thomas Monjalon <redacted>; Juhamatti Kuusisaari <redacted>; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

On Sat, Jul 23, 2016 at 11:15:27AM +0000, Ananyev, Konstantin wrote:
quoted
quoted
-----Original Message-----
From: Jerin Jacob [mailto:jerin.jacob@caviumnetworks.com]
Sent: Saturday, July 23, 2016 11:39 AM
To: Ananyev, Konstantin <redacted>
Cc: Thomas Monjalon <redacted>; Juhamatti
Kuusisaari [off-list ref]; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH] lib: change rte_ring dequeue to
guarantee ordering before tail update

On Sat, Jul 23, 2016 at 10:14:51AM +0000, Ananyev, Konstantin wrote:
quoted
Hi lads,
quoted
On Sat, Jul 23, 2016 at 11:02:33AM +0200, Thomas Monjalon wrote:
quoted
2016-07-23 8:05 GMT+02:00 Jerin Jacob [off-list ref]:
quoted
On Thu, Jul 21, 2016 at 11:26:50PM +0200, Thomas Monjalon wrote:
quoted
quoted
quoted
Consumer queue dequeuing must be guaranteed to be done
fully before the tail is updated. This is not
guaranteed with a read barrier, changed to a write
barrier just before tail update which in
practice guarantees correct order of reads and writes.
quoted
quoted
quoted
quoted
quoted
Signed-off-by: Juhamatti Kuusisaari
[off-list ref]
Acked-by: Konstantin Ananyev
[off-list ref]
Applied, thanks
There was ongoing discussion on this
http://dpdk.org/ml/archives/dev/2016-July/044168.html
Sorry Jerin, I forgot this email.
The problem is that nobody replied to your email and you did
not nack the v2 of this patch.
It's probably my bad.
I acked the patch before Jerin response, and forgot to reply later.
quoted
quoted
quoted
This change may not be required as it has the performance impact.
We need to clearly understand what is the performance impact
(numbers and use cases) on one hand, and is there a real bug
fixed by this patch on the other hand?
IHMO, there is no real bug here. rte_smb_rmb() provides the
LOAD-STORE barrier to make sure tail pointer WRITE happens only after prior LOADS.
Yep, from what I read at the link Jerin provided, indeed it seems rte_smp_rmb() is enough for the arm arch here...
For ppc, as I can see both rte_smp_rmb()/rte_smp_wmb() emits the same instruction.
quoted
Thoughts?
Wonder how big is a performance impact?
With this change we need to wait for addtional STORES to be completed to local buffer in addtion to LOADS from ring buffers memory.
I understand that, just wonder did you see any real performance difference?
Yeah...
Ok, then I don't see any good reason why we shouldn't revert it.
I suppose the best way would be to submit a new patch for RC5 to revert the changes.
Do you prefer to submit it yourself and I'll ack it or visa-versa?
Thanks
Konstantin 
quoted
Probably with ring_perf_autotest/mempool_perf_autotest or something?
W/O change
RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 4 MP/MC single enq/dequeue: 16 SP/SC burst enq/dequeue
(size: 8): 0 MP/MC burst enq/dequeue (size: 8): 2 SP/SC burst enq/dequeue (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.35
MC empty dequeue: 0.60

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 0.93
MP/MC bulk enq/dequeue (size: 8): 2.45
SP/SC bulk enq/dequeue (size: 32): 0.58
MP/MC bulk enq/dequeue (size: 32): 0.97

### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 1.89 MP/MC bulk enq/dequeue (size: 8): 4.28 SP/SC bulk
enq/dequeue (size: 32): 0.90 MP/MC bulk enq/dequeue (size: 32): 1.19 Test OK
RTE>>

With change
RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 6 MP/MC single enq/dequeue: 16 SP/SC burst enq/dequeue
(size: 8): 1 MP/MC burst enq/dequeue (size: 8): 2 SP/SC burst enq/dequeue (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.35
MC empty dequeue: 0.60

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 1.28
MP/MC bulk enq/dequeue (size: 8): 2.47
SP/SC bulk enq/dequeue (size: 32): 0.64
MP/MC bulk enq/dequeue (size: 32): 0.97

### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 2.08 MP/MC bulk enq/dequeue (size: 8): 4.29 SP/SC bulk
enq/dequeue (size: 32): 1.24 MP/MC bulk enq/dequeue (size: 32): 1.19 Test OK
quoted
Konstantin
quoted
quoted
If there is a real one, I suppose we can revert the patch?
Request to revert this one as their no benifts for other
architectures and indeed it creates addtional delay in waiting for STORES to complete in ARM.
Lets do the correct thing by reverting it.

Jerin


quoted
Konstantin
quoted
quoted
Please guys make things clear and we'll revert if needed.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help