Thread (339 messages) 339 messages, 17 authors, 2021-10-17

Re: [dpdk-dev] [PATCH] dmadev: introduce DMA device library

From: Jerin Jacob <hidden>
Date: 2021-07-11 07:14:42

On Fri, Jul 9, 2021 at 2:44 PM Bruce Richardson
[off-list ref] wrote:
On Fri, Jul 09, 2021 at 12:05:40AM +0530, Jerin Jacob wrote:
quoted
On Thu, Jul 8, 2021 at 8:41 AM fengchengwen [off-list ref] wrote:
quoted
quoted
quoted
quoted
quoted
It's just more conditionals and branches all through the code. Inside the
user application, the user has to check whether to set the flag or not (or
special-case the last transaction outside the loop), and within the driver,
there has to be a branch whether or not to call the doorbell function. The
code on both sides is far simpler and more readable if the doorbell
function is exactly that - a separate function.
I disagree. The reason is:

We will have two classes of applications

a) do dma copy request as and when it has data(I think, this is the
prime use case), for those,
I think, it is considerable overhead to have two function invocation
per transfer i.e
rte_dma_copy() and rte_dma_perform()

b) do dma copy when the data is reached to a logical state,  like copy
IP frame from Ethernet packets or so,
In that case, the application will have  a LOGIC to detect when to
perform it so on the end of
that rte_dma_copy() flag can be updated to fire the doorbell.

IMO, We are comparing against a branch(flag is already in register) vs
a set of instructions for
1) function pointer overhead
2) Need to use the channel context again back in another function.

IMO, a single branch is most optimal from performance PoV.
Ok, let's try it and see how it goes.
Test result show:
1) For Kunpeng platform (ARMv8) could benefit very little with doorbell in flags
2) For Xeon E5-2690 v2 (X86) could benefit with separate function
3) Both platform could benefit with doorbell in flags if burst < 5

There is a performance gain in small bursts (<5). Given the extensive use of bursts
 in DPDK applications and users are accustomed to the concept, I do
not recommend
quoted
using the 'doorbell' in flags.
There is NO concept change between one option vs other option. Just
argument differnet.
Also, _perform() scheme not used anywhere in DPDK. I

Regarding performance, I have added dummy instructions to simulate the real work
load[1], now burst also has some gain in both x86 and arm64[3]

I have modified your application[2] to dpdk test application to use
cpu isolation etc.
So this is gain in flag scheme ad code is checked in to Github[2[
<snip>

The benchmark numbers all seem very close between the two schemes. On my
team we pretty much have test ioat & idxd drivers ported internally to the
last dmadev draft library, and have sample apps handling traffic using
those. I'll therefore attempt to get these numbers with real traffic on
real drivers to just double check that it's the same as these
microbenchmarks.
Thanks.
Assuming that perf is the same, how to resolve this? Some thoughts:
* As I understand it, the main objection to the separate doorbell function
  is the use of 8-bytes in fastpath slot. Therefore I will also attempt to
  benchmark having the doorbell function not on the same cacheline and check
  perf impact, if any.
Probably we can remove rte_dmadev_fill_sg() variant and keep sg only for copy
to save 8B.
* If we don't have a impact to perf by having the doorbell function inside
  the regular "ops" rather than on fastpath cacheline, there is no reason
  we can't implement both schemes. The user can then choose themselves
  whether to doorbell using a flag on last item, or to doorbell explicitly
  using function call.
Yes. I think, we can keep both.
Of the two schemes, and assuming they are equal, I do have a preference for
the separate function one, primarily from a code readability point of view.
Other than that, I have no strong opinions.

/Bruce
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help