Thread (339 messages) 339 messages, 17 authors, 2021-10-17

Re: [dpdk-dev] [PATCH] dmadev: introduce DMA device library

From: Bruce Richardson <hidden>
Date: 2021-07-07 08:35:30

On Wed, Jul 07, 2021 at 01:38:58PM +0530, Jerin Jacob wrote:
On Mon, Jul 5, 2021 at 10:46 PM Bruce Richardson
[off-list ref] wrote:
quoted
On Mon, Jul 05, 2021 at 09:25:34PM +0530, Jerin Jacob wrote:
quoted
On Mon, Jul 5, 2021 at 4:22 PM Bruce Richardson
[off-list ref] wrote:
quoted
On Sun, Jul 04, 2021 at 03:00:30PM +0530, Jerin Jacob wrote:
quoted
On Fri, Jul 2, 2021 at 6:51 PM Chengwen Feng [off-list ref] wrote:
quoted
This patch introduces 'dmadevice' which is a generic type of DMA
device.
<snip>
quoted
quoted
+1 and the terminology with regards to queues and channels. With our ioat
hardware, each HW queue was called a channel for instance.
Looks like <dmadev> <> <channel> can cover all the use cases, if the
HW has more than
1 queues it can be exposed as separate dmadev dev.
Fine for me.

However, just to confirm that Morten's suggestion of using a
(device-specific void *) channel pointer rather than dev_id + channel_id
pair of parameters won't work for you? You can't store a pointer or dev
index in the channel struct in the driver?
Yes. That will work. To confirm, the suggestion is to use, void *
object instead of channel_id,
That will avoid one more indirection.(index -> pointer)
The proposal was to use it in place of "dev_id + channel_id", i.e.

copy(dev_id, ch_id, src, dst, len, flags) --> copy(ch_ptr, src, dst, len, flags)

Where the channel pointer implicitly indicates the device too. However, I
realise now that this would be something completely transparent to the
driver as it would all have to be implemented in the dmadev level, and
lead to lots of duplication of function pointers, etc. Therefore, let's
just go with original scheme. :-(
quoted
quoted
<snip>
quoted
quoted
Got it. In order to save space if first CL size for fastpath(Saving 8B
for the pointer) and to avoid
function overhead, Can we use one bit of flags of op function to
enable the fence?
The original ioat implementation did exactly that. However, I then
discovered that because a fence logically belongs between two operations,
does the fence flag on an operation mean "don't do any jobs after this
until this job has completed" or does it mean "don't start this job until
all previous jobs have completed". [Or theoretically does it mean both :-)]
Naturally, some hardware does it the former way (i.e. fence flag goes on
last op before fence), while other hardware the latter way (i.e. fence flag
goes on first op after the fence). Therefore, since fencing is about
ordering *between* two (sets of) jobs, I decided that it should do exactly
that and go between two jobs, so there is no ambiguity!

However, I'm happy enough to switch to having a fence flag, but I think if
we do that, it should be put in the "first job after fence" case, because
it is always easier to modify a previously written job if we need to, than
to save the flag for a future one.

Alternatively, if we keep the fence as a separate function, I'm happy
enough for it not to be on the same cacheline as the "hot" operations,
since fencing will always introduce a small penalty anyway.
Ack.
You may consider two flags, FENCE_THEN_JOB and JOB_THEN_FENCE( If
there any use case for this or it makes sense for your HW)


For us, Fence is NOP for us as we have an implicit fence between each
HW job descriptor.
I actually still think that having a separate fence function in the "ops"
section is the best approach here. It's unabiguous as to whether it's
fence-before or fence-after, and if we have it in the ops, it doesn't use a
"fast-path" slot.

However, if we *really* want to use a flag instead, I don't see the value
in having two flags, it will be really confusing.  Instead, if we do go
with a flag, I think "RTE_DMA_PRE_FENCE" should be the name, indicating
that the fence occurs before the job in question.
quoted
quoted
quoted
quoted
<snip>
quoted
quoted
quoted
Since we have additional function call overhead in all the
applications for this scheme, I would like to understand
the use of doing this way vs enq does the doorbell implicitly from
driver/application PoV?
In our benchmarks it's just faster. When we tested it, the overhead of the
function calls was noticably less than the cost of building up the
parameter array(s) for passing the jobs in as a burst. [We don't see this
cost with things like NIC I/O since DPDK tends to already have the mbuf
fully populated before the TX call anyway.]
OK. I agree with stack population.

My question was more on doing implicit doorbell update enq. Is doorbell write
costly in other HW compare to a function call? In our HW, it is just write of
the number of instructions written in a register.

Also, we need to again access the internal PMD memory structure to find
where to write etc if it is a separate function.
The cost varies depending on a number of factors - even writing to a single
HW register can be very slow if that register is mapped as device
(uncacheable) memory, since (AFAIK) it will act as a full fence and wait
I don't know, At least in our case, writes are write-back. so core does not need
to wait.(If there is no read operation).
quoted
for the write to go all the way to hardware. For more modern HW, the cost
can be lighter. However, any cost of HW writes is going to be the same
whether its a separate function call or not.

However, the main thing about the doorbell update is that it's a
once-per-burst thing, rather than a once-per-job. Therefore, even if you
have to re-read the struct memory (which is likely still somewhere in your
cores' cache), any extra small cost of doing so is to be amortized over the
cost of a whole burst of copies.
Linux kernel has xmit_more flag in skb to address similar thing.
i.e enq job flag can have one more bit field to say update ring bell or not?
Rather having yet another function overhead.IMO, it is the best of both worlds.
It's just more conditionals and branches all through the code. Inside the
user application, the user has to check whether to set the flag or not (or
special-case the last transaction outside the loop), and within the driver,
there has to be a branch whether or not to call the doorbell function. The
code on both sides is far simpler and more readable if the doorbell
function is exactly that - a separate function.
quoted
quoted
quoted
quoted
<snip>
quoted
quoted
quoted
quoted
+ +/** + * @warning + * @b EXPERIMENTAL: this API may change
without prior notice.  + * + * Returns the number of operations
that failed to complete.  + * NOTE: This API was used when
rte_dmadev_completed has_error was set.  + * + * @param dev_id
+ *   The identifier of the device.  + * @param vq_id + *   The
identifier of virt queue.
(> + * @param nb_status
quoted
+ *   Indicates the size  of status array.  + * @param[out]
status + *   The error code of operations that failed to
complete.  + * @param[out] cookie + *   The last failed
completed operation's cookie.  + * + * @return + *   The number
of operations that failed to complete.  + * + * NOTE: The
caller must ensure that the input parameter is valid and the +
*       corresponding device supports the operation.  + */
+__rte_experimental +static inline uint16_t
+rte_dmadev_completed_fails(uint16_t dev_id, uint16_t vq_id, +
const uint16_t nb_status, uint32_t *status, +
dma_cookie_t *cookie)
IMO, it is better to move cookie/rind_idx at 3.  Why it would
return any array of errors? since it called after
rte_dmadev_completed() has has_error. Is it better to change

rte_dmadev_error_status((uint16_t dev_id, uint16_t vq_id,
dma_cookie_t *cookie,  uint32_t *status)

I also think, we may need to set status as bitmask and enumerate
all the combination of error codes of all the driver and return
string from driver existing rte_flow_error

See struct rte_flow_error { enum rte_flow_error_type type; /**<
Cause field and error types. */ const void *cause; /**< Object
responsible for the error. */ const char *message; /**<
Human-readable error message. */ };
I think we need a multi-return value API here, as we may add
operations in future which have non-error status values to return.
The obvious case is DMA engines which support "compare" operations.
In that case a successful compare (as in there were no DMA or HW
errors) can return "equal" or "not-equal" as statuses. For general
"copy" operations, the faster completion op can be used to just
return successful values (and only call this status version on
error), while apps using those compare ops or a mixture of copy and
compare ops, would always use the slower one that returns status
values for each and every op..

The ioat APIs used 32-bit integer values for this status array so
as to allow e.g. 16-bits for error code and 16-bits for future
status values. For most operations there should be a fairly small
set of things that can go wrong, i.e. bad source address, bad
destination address or invalid length.  Within that we may have a
couple of specifics for why an address is bad, but even so I don't
think we need to start having multiple bit combinations.
OK. What is the purpose of errors status? Is it for application
printing it or Does the application need to take any action based on
specific error requests?
It's largely for information purposes, but in the case of SVA/SVM
errors could occur due to the memory not being pinned, i.e. a page
fault, in some cases. If that happens, then it's up the app to either
touch the memory and retry the copy, or to do a SW memcpy as a
fallback.

In other error cases, I think it's good to tell the application if it's
passing around bad data, or data that is beyond the scope of hardware,
e.g.  a copy that is beyond what can be done in a single transaction
for a HW instance. Given that there are always things that can go
wrong, I think we need some error reporting mechanism.
quoted
If the former is scope, then we need to define the standard enum
value for the error right?  ie. uint32_t *status needs to change to
enum rte_dma_error or so.
Sure. Perhaps an error/status structure either is an option, where we
explicitly call out error info from status info.
Agree. Better to have a structure with filed like,

1)  enum rte_dma_error_type 2)  memory to store, informative message on
fine aspects of error.  LIke address caused issue etc.(Which will be
driver-specific information).
The only issue I have with that is that once we have driver specific
information it is of little use to the application, since it can't know
anything about it excepy maybe log it.  I'd much rather have a set of error
codes telling user that "source address is wrong", "dest address is wrong",
and a generic "an address is wrong" in case driver/HW cannot distinguish
source of error. Can we see how far we get with just error codes before we
start into passing string messages around and all the memory management
headaches that implies.

/Bruce
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help