Thread (128 messages) 128 messages, 11 authors, 2021-11-08

Re: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing library

From: Elena Agostini <hidden>
Date: 2021-09-06 16:12:33

-----Original Message-----
From: Jerin Jacob <redacted>
Sent: Thursday, September 2, 2021 3:12 PM
To: Elena Agostini <redacted>
Cc: Wang, Haiyue <redacted>; NBU-Contact-Thomas
Monjalon [off-list ref]; Jerin Jacob [off-list ref];
dpdk-dev [off-list ref]; Stephen Hemminger
[off-list ref]; David Marchand
[off-list ref]; Andrew Rybchenko
[off-list ref]; Honnappa Nagarahalli
[off-list ref]; Yigit, Ferruh [off-list ref];
techboard@dpdk.org
Subject: Re: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing
library


On Wed, Sep 1, 2021 at 9:05 PM Elena Agostini [off-list ref]
wrote:
quoted
quoted
-----Original Message-----
From: Wang, Haiyue <redacted>
Sent: Sunday, August 29, 2021 7:33 AM
To: Jerin Jacob <redacted>; NBU-Contact-Thomas
Monjalon
quoted
quoted
[off-list ref]
Cc: Jerin Jacob <redacted>; dpdk-dev <redacted>;
Stephen Hemminger [off-list ref]; David Marchand
[off-list ref]; Andrew Rybchenko
[off-list ref]; Honnappa Nagarahalli
[off-list ref]; Yigit, Ferruh
[off-list ref]; techboard@dpdk.org; Elena Agostini
[off-list ref]
Subject: RE: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing
library


quoted
-----Original Message-----
From: Jerin Jacob <redacted>
Sent: Friday, August 27, 2021 20:19
To: Thomas Monjalon <redacted>
Cc: Jerin Jacob <redacted>; dpdk-dev <redacted>;
Stephen
Hemminger
quoted
[off-list ref]; David Marchand
[off-list ref]; Andrew Rybchenko
quoted
[off-list ref]; Wang, Haiyue
[off-list ref];
Honnappa Nagarahalli
quoted
[off-list ref]; Yigit, Ferruh
[off-list ref];
techboard@dpdk.org; Elena
quoted
Agostini [off-list ref]
Subject: Re: [dpdk-dev] [RFC PATCH v2 0/7] heterogeneous computing
library

On Fri, Aug 27, 2021 at 3:14 PM Thomas Monjalon
[off-list ref]
wrote:
quoted
quoted
31/07/2021 15:42, Jerin Jacob:
quoted
On Sat, Jul 31, 2021 at 1:51 PM Thomas Monjalon
[off-list ref] wrote:
quoted
quoted
quoted
quoted
31/07/2021 09:06, Jerin Jacob:
quoted
On Fri, Jul 30, 2021 at 7:25 PM Thomas Monjalon
[off-list ref] wrote:
quoted
quoted
quoted
quoted
quoted
quoted
From: Elena Agostini <redacted>

In heterogeneous computing system, processing is not
only in the
CPU.
quoted
quoted
quoted
quoted
quoted
quoted
Some tasks can be delegated to devices working in parallel.

The goal of this new library is to enhance the
collaboration between DPDK, that's primarily a CPU
framework, and other type of devices
like GPUs.
quoted
quoted
quoted
quoted
quoted
quoted
When mixing network activity with task processing on a
non-CPU
device,
quoted
quoted
quoted
quoted
quoted
quoted
there may be the need to put in communication the CPU
with the
device
quoted
quoted
quoted
quoted
quoted
quoted
in order to manage the memory, synchronize operations,
exchange
info, etc..
quoted
quoted
quoted
quoted
quoted
quoted
This library provides a number of new features:
- Interoperability with device specific library with
generic handlers
- Possibility to allocate and free memory on the device
- Possibility to allocate and free memory on the CPU but
visible from
the device
quoted
quoted
quoted
quoted
quoted
quoted
- Communication functions to enhance the dialog between
the CPU
and the device
quoted
quoted
quoted
quoted
quoted
quoted
The infrastructure is prepared to welcome drivers in
drivers/hc/ as the upcoming NVIDIA one, implementing the
hcdev API.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
Some parts are not complete:
  - locks
  - memory allocation table
  - memory freeing
  - guide documentation
  - integration in devtools/check-doc-vs-code.sh
  - unit tests
  - integration in testpmd to enable Rx/Tx to/from GPU
memory.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
Since the above line is the crux of the following text, I
will start from this point.

+ Techboard

I  can give my honest feedback on this.

I can map similar  stuff  in Marvell HW, where we do
machine learning as compute offload on a different class
of CPU.

In terms of RFC patch features

1) memory API - Use cases are aligned
2) communication flag and communication list Our structure
is completely different and we are using HW ring kind of
interface to post the job to compute interface and the job
completion result happens through the event device.
Kind of similar to the DMA API that has been discussed on
the mailing
list.
quoted
quoted
quoted
quoted
Interesting.
It is hard to generalize the communication mechanism.
Is other GPU vendors have a similar communication mechanism?
AMD,
Intel ??
quoted
quoted
I don't know who to ask in AMD & Intel. Any ideas?
Good question.

At least in Marvell HW, the communication flag and communication
list is our structure is completely different and we are using HW
ring kind of interface to post the job to compute interface and
the job completion result happens through the event device.
kind of similar to the DMA API that has been discussed on the mailing
list.
quoted
Please correct me if I'm wrong but what you are describing is a
specific way to submit work on the device. Communication flag/list
here is a direct data communication between the CPU and some kind of
workload (e.g. GPU kernel) that's already running on the device.
Exactly. What I meant is Communication flag/list is not generic enough to
express and generic compute device. If all GPU works in this way, we could
make the library name as GPU specific and add GPU specific communication
mechanism.
I'm in favor of reverting the name of the library with a more specific gpudev name
instead of hcdev. This library (both memory allocations and fancy features like
communication lists) can be tested on various GPUs but I'm not sure about
other type of devices. 

Again, as initial step, I would not complicate things
Let's have a GPU oriented library for now.
quoted
The rationale here is that:
- some work has been already submitted on the device and it's running
- CPU needs a real-time direct interaction through memory with the
device
- the workload on the device needs some info from the CPU it can't get
at submission time

This is good enough for NVIDIA and AMD GPU.
Need to double check for Intel GPU.
quoted
quoted
quoted
quoted
quoted
quoted
Now the bigger question is why need to Tx and then Rx
something to compute the device Isn't  ot offload
something? If so, why not add the those offload in
respective subsystem to improve the subsystem(ethdev,
cryptiodev etc) features set to adapt new features or
introduce new subsystem (like ML, Inline Baseband
processing) so that it will be an opportunity to implement
the same in  HW or compute device. For example, if we take
this path, ML offloading will be application code like
testpmd, which deals with "specific" device commands(aka
glorified rawdev) to deal with specific computing device
offload "COMMANDS"
(The commands will be specific to  offload device, the
same code wont run on  other compute device)
Having specific features API is convenient for compatibility
between devices, yes, for the set of defined features.
Our approach is to start with a flexible API that the
application can use to implement any processing because with
GPU programming, there is no restriction on what can be
achieved.
quoted
quoted
quoted
quoted
quoted
quoted
This approach does not contradict what you propose, it does
not prevent extending existing classes.
It does prevent extending the existing classes as no one is
going to extent it there is the path of not doing do.
I disagree. Specific API is more convenient for some tasks, so
there is an incentive to define or extend specific device class APIs.
But it should not forbid doing custom processing.
This is the same as the raw device is in DPDK where the device
personality is not defined.

Even if define another API and if the personality is not defined,
it comes similar to the raw device as similar to rawdev enqueue
and dequeue.

To summarize,

1)  My _personal_ preference is to have specific subsystems to
improve the DPDK instead of the raw device kind of path.
Something like rte_memdev to focus on device (GPU) memory
management ?
quoted
quoted
The new DPDK auxiliary bus maybe make life easier to solve the
complex heterogeneous computing library. ;-)
To get a concrete idea about what's the best and most comprehensive
approach we should start with something that's flexible and simple
enough.
quoted
A dedicated library it's a good starting point: easy to implement and
embed in DPDK applications, isolated from other components and users
can play with it learning from the code.
quoted
As a second step we can think to embed the functionality in some other
way within DPDK (e.g. split memory management and communication
features).
quoted
quoted
quoted
2) If the device personality is not defined, use rawdev
3) All computing devices do not use  "communication flag" and
"communication list"
kind of structure. If are targeting a generic computing device
then that is not a portable scheme.
For GPU abstraction if "communication flag" and "communication
list"
quoted
quoted
quoted
is the right kind of mechanism
then we can have a separate library for GPU communication specific
to GPU <-

DPDK communication needs and explicit for GPU.

I think generic DPDK applications like testpmd should not pollute
with device-specific functions. Like, call device-specific
messages from the application which makes the application runs
only one device. I don't have a strong opinion(expect
standardizing  "communication flag" and "communication list" as
generic computing device communication mechanism) of others think
it is OK to do that way in DPDK.
I'd like to introduce (with a dedicated option) the memory API in
testpmd to provide an example of how to TX/RX packets using device
memory.

Not sure without embedding sideband communication mechanism how it
can notify to GPU and back to CPU. If you could share the example API
sequence that helps to us understand the level of coupling with testpmd.
There is no need of communication mechanism here.
Assuming there is not workload to process network packets (to not complicate
things), the steps are:
1) Create a DPDK mempool with device external memory using the hcdev (or gpudev) library
2) Use that mempool to tx/rx/fwd packets

As an example, you look at my l2fwd-nv application here: https://github.com/NVIDIA/l2fwd-nv
quoted
I agree to not embed communication flag/list features.
quoted
quoted
quoted
quoted
If an application can run only on a specific device, it is
similar to a raw device, where the device definition is not
defined. (i.e JOB metadata is not defined
and
quoted
quoted
quoted
it is specific to the device).
quoted
quoted
Just my _personal_ preference is to have specific
subsystems to improve the DPDK instead of raw device kind
of path. If we decide another path as a community it is
_fine_ too(as a _project manager_ point of view it will be
an easy path to dump SDK stuff to DPDK without introducing
the pain of the subsystem nor improving the DPDK).
Adding a new class API is also improving DPDK.
But the class is similar as raw dev class. The reason I say,
Job submission and response is can be abstracted as
queue/dequeue APIs.
quoted
quoted
quoted
quoted
quoted
Taks/Job metadata is specific to compute devices (and it can
not be generalized).
If we generalize it makes sense to have a new class that does
"specific function".
Computing device programming is already generalized with
languages like
OpenCL.
quoted
quoted
We should not try to reinvent the same.
We are just trying to properly integrate the concept in DPDK and
allow building on top of it.
Agree.
quoted
quoted
See above.
quoted
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help