Thread (76 messages) 76 messages, 11 authors, 2025-04-22

Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction

From: Nikolay Aleksandrov <hidden>
Date: 2025-03-19 14:02:36
Also in: linux-rdma

On 3/19/25 15:52, Jason Gunthorpe wrote:
On Fri, Mar 14, 2025 at 02:53:40PM +0000, Bernard Metzler wrote:
quoted
I assume the correct way forward is to first clarify the
structure of all user-visible objects that need to be
created/controlled/destroyed, and to route them through
this interface. Some will require extensions to given objects,
some may be new, some will be as-is. rdma_netlink will probably
be the right interface to look at for job control.
As I understand the job ID model you will need to have some privileged
entity to create a "job ID file descriptor" that can be passed around
to unprivileged processes to grant them access to the job ID. This is
necessary since the Job ID becomes part of the packet headers and we
must secure userspace to prevent a hijack or spoof these values on the
wire.

Netlink has a major downside that you can't use filesystem ACL
permissions to control access, so building a low privilege daemon just
to do job id management seems to me to be more difficult.

As an example, I would imagine having a job management char device
with a filesystem ACL that only allows something like SLRUM's
privileged orchestrator to talk to it. SLURM wouldn't have something
like CAP_NET_ADMIN. SLURM would setup the job ID and pass the "Job ID
FD" to the actual MPI workload processes to grant them permission to
use those network headers.

Nobody else in the system can create Job ID's besides SLURM, and in a
multi-user environment one user cannot reach into the other and hijack
their job ID because the FD does not leak outside the MPI process
tree.

This RFC doesn't describe the intended security model, but I'm very
surprised to see ultraeth_nl_job_new_doit() not do any capability
checks, or any security what so ever around access to the job.
It doesn't need to do any capability checking because it is defined in the YAML
model, there you can see flags: [ admin-perm ] so in the genl ops code that is
automatically generated we get .flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO
for these ops, which in turn means the genetlink code will check if the caller has
CAP_NET_ADMIN. The unprivileged process can request to associate with multiple jobs
and it's the privileged process that has to configure and control them. In this
version we have only configuration. Once the specs become publicly available we
will be able to share more information about how it's expected to work.

Cheers,
 Nik

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help