Thread (128 messages) 128 messages, 11 authors, 2021-11-08

Re: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library

From: Thomas Monjalon <hidden>
Date: 2021-08-27 09:44:47

31/07/2021 15:42, Jerin Jacob:
On Sat, Jul 31, 2021 at 1:51 PM Thomas Monjalon [off-list ref] wrote:
quoted
31/07/2021 09:06, Jerin Jacob:
quoted
On Fri, Jul 30, 2021 at 7:25 PM Thomas Monjalon [off-list ref] wrote:
quoted
From: Elena Agostini <redacted>

In heterogeneous computing system, processing is not only in the CPU.
Some tasks can be delegated to devices working in parallel.

The goal of this new library is to enhance the collaboration between
DPDK, that's primarily a CPU framework, and other type of devices like GPUs.

When mixing network activity with task processing on a non-CPU device,
there may be the need to put in communication the CPU with the device
in order to manage the memory, synchronize operations, exchange info, etc..

This library provides a number of new features:
- Interoperability with device specific library with generic handlers
- Possibility to allocate and free memory on the device
- Possibility to allocate and free memory on the CPU but visible from the device
- Communication functions to enhance the dialog between the CPU and the device

The infrastructure is prepared to welcome drivers in drivers/hc/
as the upcoming NVIDIA one, implementing the hcdev API.

Some parts are not complete:
  - locks
  - memory allocation table
  - memory freeing
  - guide documentation
  - integration in devtools/check-doc-vs-code.sh
  - unit tests
  - integration in testpmd to enable Rx/Tx to/from GPU memory.
Since the above line is the crux of the following text, I will start
from this point.

+ Techboard

I  can give my honest feedback on this.

I can map similar  stuff  in Marvell HW, where we do machine learning
as compute offload
on a different class of CPU.

In terms of RFC patch features

1) memory API - Use cases are aligned
2) communication flag and communication list
Our structure is completely different and we are using HW ring kind of
interface to post the job to compute interface and
the job completion result happens through the event device.
Kind of similar to the DMA API that has been discussed on the mailing list.
Interesting.
It is hard to generalize the communication mechanism.
Is other GPU vendors have a similar communication mechanism? AMD, Intel ??
I don't know who to ask in AMD & Intel. Any ideas?
quoted
quoted
Now the bigger question is why need to Tx and then Rx something to
compute the device
Isn't  ot offload something? If so, why not add the those offload in
respective subsystem
to improve the subsystem(ethdev, cryptiodev etc) features set to adapt
new features or
introduce new subsystem (like ML, Inline Baseband processing) so that
it will be an opportunity to
implement the same in  HW or compute device. For example, if we take
this path, ML offloading will
be application code like testpmd, which deals with "specific" device
commands(aka glorified rawdev)
to deal with specific computing device offload "COMMANDS"
(The commands will be specific to  offload device, the same code wont
run on  other compute device)
Having specific features API is convenient for compatibility
between devices, yes, for the set of defined features.
Our approach is to start with a flexible API that the application
can use to implement any processing because with GPU programming,
there is no restriction on what can be achieved.
This approach does not contradict what you propose,
it does not prevent extending existing classes.
It does prevent extending the existing classes as no one is going to
extent it there is the path of not doing do.
I disagree. Specific API is more convenient for some tasks,
so there is an incentive to define or extend specific device class APIs.
But it should not forbid doing custom processing.
If an application can run only on a specific device, it is similar to
a raw device,
where the device definition is not defined. (i.e JOB metadata is not defined and
it is specific to the device).
quoted
quoted
Just my _personal_ preference is to have specific subsystems to
improve the DPDK instead of raw device kind of
path. If we decide another path as a community it is _fine_ too(as a
_project manager_ point of view it will be an easy path to dump SDK
stuff to DPDK without introducing the pain of the subsystem nor
improving the DPDK).
Adding a new class API is also improving DPDK.
But the class is similar as raw dev class. The reason I say,
Job submission and response is can be abstracted as queue/dequeue APIs.
Taks/Job metadata is specific to compute devices (and it can not be
generalized).
If we generalize it makes sense to have a new class that does
"specific function".
Computing device programming is already generalized with languages like OpenCL.
We should not try to reinvent the same.
We are just trying to properly integrate the concept in DPDK
and allow building on top of it.

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help