Thread (60 messages) 60 messages, 6 authors, 2015-10-28

Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

From: Parav Pandit <hidden>
Date: 2015-09-11 03:40:05
Also in: linux-rdma, lkml

On Fri, Sep 11, 2015 at 1:52 AM, Tejun Heo [off-list ref] wrote:
Hello, Parav.

On Thu, Sep 10, 2015 at 11:16:49PM +0530, Parav Pandit wrote:
quoted
quoted
quoted
These resources include are-  QP (queue pair) to transfer data, CQ
(Completion queue) to indicate completion of data transfer operation,
MR (memory region) to represent user application memory as source or
destination for data transfer.
Common resources are QP, SRQ (shared received queue), CQ, MR, AH
(Address handle), FLOW, PD (protection domain), user context etc.
It's kinda bothering that all these are disparate resources.
Actually not. They are linked resources. Every QP needs associated one
or two CQ, one PD.
Every QP will use few MRs for data transfer.
So, if that's the case, let's please implement something higher level.
The goal is providing reasonable isolation or protection.  If that can
be achieved at a higher level of abstraction, please do that.
quoted
Here is the good programming guide of the RDMA APIs exposed to the
user space application.

http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf
So first version of the cgroups patch will address the control
operation for section 3.4.
quoted
I suppose that each restriction comes from the underlying hardware and
there's no accepted higher level abstraction for these things?
There is higher level abstraction which is through the verbs layer
currently which does actually expose the hardware resource but in
vendor agnostic way.
There are many vendors who support these verbs layer, some of them
which I know are Mellanox, Intel, Chelsio, Avago/Emulex whose drivers
which support these verbs are in <drivers/infiniband/hw/> kernel tree.

There is higher level APIs above the verb layer, such as MPI,
libfabric, rsocket, rds, pgas, dapl which uses underlying verbs layer.
They all rely on the hardware resource. All of these higher level
abstraction is accepted and well used by certain application class. It
would be long discussion to go over them here.
Well, the programming interface that userland builds on top doesn't
matter too much here but if there is a common resource abstraction
which can be made in terms of constructs that consumers of the
facility would care about, that likely is a better choice than
exposing whatever hardware exposes.
Tejun,
The fact is that user level application uses hardware resources.
Verbs layer is software abstraction for it. Drivers are hiding how
they implement this QP or CQ or whatever hardware resource they
project via API layer.
For all of the userland on top of verb layer I mentioned above, the
common resource abstraction is these resources AH, QP, CQ, MR etc.
Hardware (and driver) might have different view of this resource in
their real implementation.
For example, verb layer can say that it has 100 QPs, but hardware
might actually have 20 QPs that driver decide how to efficiently use
it.
quoted
quoted
I'm doubtful that these things are gonna be mainstream w/o building up
higher level abstractions on top and if we ever get there we won't be
talking about MR or CQ or whatever.
Some of the higher level examples I gave above will adapt to resource
allocation failure. Some are actually adaptive to few resource
allocation failure, they do query resources. But its not completely
there yet. Once we have this notion of limited resource in place,
abstraction layer would adapt to relatively smaller value of such
resource.

These higher level abstraction is mainstream. Its shipped at least in
Redhat Enterprise Linux.
Again, I was talking more about resource abstraction - e.g. something
along the line of "I want N command buffers".
Yes. We are still talking of resource abstraction here.
RDMA and IBTA defines these resources. On top of these resources
various frameworks are build.
so for example,
User land is tuning environment deploying for MPI application,
it would configure:
10 processes from the PID controller,
10 CPUs in cpuset controller,
1 PD, 20 CQ, 10 QP, 100 MRs in rdma controller,

say user land is tuning environment for deploying rsocket application
for 100 connections,
it would configure, 100 PD, 100 QP, 200 MR.
When verb layer see failure with it, they will adapt to live with what
they have at lower performance.

Since every higher level which I mentioned in different in the way, it
uses RDMA resources, we cannot generalize it as "N command buffers".
That generalization in my mind is the - rdma resources - central common entity.
quoted
quoted
Also, whatever next-gen is
unlikely to have enough commonalities when the proposed resource knobs
are this low level,
I agree that resource won't be common in next-gen other transport
whenever they arrive.
But with my existing background working on some of those transport,
they appear similar in nature and it might seek similar knobs.
I don't know.  What's proposed in this thread seems way too low level
to be useful anywhere else.  Also, what if there are multiple devices?
Is that a problem to worry about?
o.k. It doesn't have to be useful anywhere else. If it suffice the
need of RDMA applications, its fine for near future.
This patch allows limiting resources across multiple devices.
As we go along the path, and if requirement come up to have knob on
per device basis, thats something we can extend in future.
quoted
I would let you make the call.
Rdma and other is just another type of device with different
characteristics than character or block, so one device cgroup with sub
functionalities can allow setting knobs.
Every device category will have their own set of knobs for resources,
ACL, limits, policy.
I'm kinda doubtful we're gonna have too many of these.  Hardware
details being exposed to userland this directly isn't common.
Its common in RDMA applications. Again they may not be real hardware
resource, its just API layer which defines those RDMA constructs.
quoted
And I think cgroup is certainly better control point than sysfs or
spinning of new control infrastructure for this.
That said, I would like to hear your and communities view on how they
would like to see this shaping up.
I'd say keep it simple and do the minimum. :)
o.k. In that case new rdma cgroup controller which does rdma resource
accounting is possibly the most simplest form?
Make sense?
Thanks.

--
tejun
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help