Re: [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library

From: fengchengwen <hidden>
Date: 2021-06-23 03:30:32

On 2021/6/23 1:25, Jerin Jacob wrote:

On Fri, Jun 18, 2021 at 3:11 PM fengchengwen [off-list ref] wrote:

quoted

On 2021/6/18 13:52, Jerin Jacob wrote:

quoted

On Thu, Jun 17, 2021 at 2:46 PM Bruce Richardson
[off-list ref] wrote:

quoted

On Wed, Jun 16, 2021 at 08:07:26PM +0530, Jerin Jacob wrote:

quoted

On Wed, Jun 16, 2021 at 3:47 PM fengchengwen [off-list ref] wrote:

quoted

On 2021/6/16 15:09, Morten Brørup wrote:

quoted

From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
Sent: Tuesday, 15 June 2021 18.39

On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:

quoted

This patch introduces 'dmadevice' which is a generic type of DMA
device.

The APIs of dmadev library exposes some generic operations which can
enable configuration and I/O with the DMA devices.

Signed-off-by: Chengwen Feng <redacted>
---

Thanks for sending this.

Of most interest to me right now are the key data-plane APIs. While we
are
still in the prototyping phase, below is a draft of what we are
thinking
for the key enqueue/perform_ops/completed_ops APIs.

Some key differences I note in below vs your original RFC:
* Use of void pointers rather than iova addresses. While using iova's
makes
  sense in the general case when using hardware, in that it can work
with
  both physical addresses and virtual addresses, if we change the APIs
to use
  void pointers instead it will still work for DPDK in VA mode, while
at the
  same time allow use of software fallbacks in error cases, and also a
stub
  driver than uses memcpy in the background. Finally, using iova's
makes the
  APIs a lot more awkward to use with anything but mbufs or similar
buffers
  where we already have a pre-computed physical address.
* Use of id values rather than user-provided handles. Allowing the
user/app
  to manage the amount of data stored per operation is a better
solution, I
  feel than proscribing a certain about of in-driver tracking. Some
apps may
  not care about anything other than a job being completed, while other
apps
  may have significant metadata to be tracked. Taking the user-context
  handles out of the API also makes the driver code simpler.
* I've kept a single combined API for completions, which differs from
the
  separate error handling completion API you propose. I need to give
the
  two function approach a bit of thought, but likely both could work.
If we
  (likely) never expect failed ops, then the specifics of error
handling
  should not matter that much.

For the rest, the control / setup APIs are likely to be rather
uncontroversial, I suspect. However, I think that rather than xstats
APIs,
the library should first provide a set of standardized stats like
ethdev
does. If driver-specific stats are needed, we can add xstats later to
the
API.

Appreciate your further thoughts on this, thanks.

Regards,
/Bruce

I generally agree with Bruce's points above.

I would like to share a couple of ideas for further discussion:


I believe some of the other requirements and comments for generic DMA will be

1) Support for the _channel_, Each channel may have different
capabilities and functionalities.
Typical cases are, each channel have separate source and destination
devices like
DMA between PCIe EP to Host memory, Host memory to Host memory, PCIe
EP to PCIe EP.
So we need some notion of the channel in the specification.

Can you share a bit more detail on what constitutes a channel in this case?
Is it equivalent to a device queue (which we are flattening to individual
devices in this API), or to a specific configuration on a queue?

It not a queue. It is one of the attributes for transfer.
I.e in the same queue, for a given transfer it can specify the
different "source" and "destination" device.
Like CPU to Sound card, CPU to network card etc.

quoted

2) I assume current data plane APIs are not thread-safe. Is it right?

Yes.

quoted

3) Cookie scheme outlined earlier looks good to me. Instead of having
generic dequeue() API

4) Can split the rte_dmadev_enqueue_copy(uint16_t dev_id, void * src,
void * dst, unsigned int length);
to two stage API like, Where one will be used in fastpath and other
one will use used in slowpath.

- slowpath API will for take channel and take other attributes for transfer

Example syantx will be:

struct rte_dmadev_desc {
           channel id;
           ops ; // copy, xor, fill etc
          other arguments specific to dma transfer // it can be set
based on capability.

};

rte_dmadev_desc_t rte_dmadev_preprare(uint16_t dev_id,  struct
rte_dmadev_desc *dec);

- Fastpath takes arguments that need to change per transfer along with
slow-path handle.

rte_dmadev_enqueue(uint16_t dev_id, void * src, void * dst, unsigned
int length,  rte_dmadev_desc_t desc)

This will help to driver to
-Former API form the device-specific descriptors in slow path  for a
given channel and fixed attributes per transfer
-Later API blend "variable" arguments such as src, dest address with
slow-path created descriptors

This seems like an API for a context-aware device, where the channel is the
config data/context that is preserved across operations - is that correct?
At least from the Intel DMA accelerators side, we have no concept of this
context, and each operation is completely self-described. The location or
type of memory for copies is irrelevant, you just pass the src/dst
addresses to reference.

it is not context-aware device. Each HW JOB is self-described.
You can view it different attributes of transfer.

quoted

The above will give better performance and is the best trade-off c
between performance and per transfer variables.

We may need to have different APIs for context-aware and context-unaware
processing, with which to use determined by the capabilities discovery.
Given that for these DMA devices the offload cost is critical, more so than
any other dev class I've looked at before, I'd like to avoid having APIs
with extra parameters than need to be passed about since that just adds
extra CPU cycles to the offload.

If driver does not support additional attributes and/or the
application does not need it, rte_dmadev_desc_t can be NULL.
So that it won't have any cost in the datapath. I think, we can go to
different API
cases if we can not abstract problems without performance impact.
Otherwise, it will be too much
pain for applications.

Yes, currently we plan to use different API for different case, e.g.
  rte_dmadev_memcpy()  -- deal with local to local memcopy
  rte_dmadev_memset()  -- deal with fill with local memory with pattern
maybe:
  rte_dmadev_imm_data()  --deal with copy very little data
  rte_dmadev_p2pcopy()   --deal with peer-to-peer copy of diffenet PCIE addr

These API capabilities will be reflected in the device capability set so that
application could know by standard API.


There will be a lot of combination of that it will be like M x N cross
base case, It won't scale.

Currently, it is hard to define generic dma descriptor, I think the well-defined
APIs is feasible.

quoted

Just to understand, I think, we need to HW capabilities and how to
have a common API.
I assume HW will have some HW JOB descriptors which will be filled in
SW and submitted to HW.
In our HW,  Job descriptor has the following main elements

- Channel   // We don't expect the application to change per transfer
- Source address - It can be scatter-gather too - Will be changed per transfer
- Destination address - It can be scatter-gather too - Will be changed
per transfer
- Transfer Length - - It can be scatter-gather too - Will be changed
per transfer
- IOVA address where HW post Job completion status PER Job descriptor
- Will be changed per transfer
- Another sideband information related to channel  // We don't expect
the application to change per transfer
- As an option, Job completion can be posted as an event to
rte_event_queue  too // We don't expect the application to change per
transfer

The 'option' field looks like a software interface field, but not HW descriptor.

It is in HW descriptor.

The HW is interesting, something like: DMA could send completion direct to EventHWQueue,
the DMA and EventHWQueue are link in the hardware range, rather than by software.

Could you provide public driver of this HW ? So we could know more about it's working
mechanism and software-hardware collaboration.

quoted

@Richardson, Bruce @fengchengwen @Hemant Agrawal

Could you share the options for your HW descriptors  which you are
planning to expose through API like above so that we can easily
converge on fastpath API

Kunpeng HW descriptor is self-describing, and don't need refer context info.

Maybe the fields which was fix with some transfer type could setup by driver, and
don't expose to application.

Yes. I agree.I think, that reason why I though to have
rte_dmadev_prep() call to convert DPDK DMA transfer attributes to HW
specific descriptors
and have single enq() operation with variable argument(through enq
parameter) and fix argumenents through rte_dmadev_prep() call object.

quoted

So that we could use more generic way to define the API.

quoted

quoted
/Bruce
.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help