Re: [dpdk-dev] [RFC 0/4] SocketPair Broker support for vhost and virtio-user.

From: Ilya Maximets <i.maximets@ovn.org>
Date: 2021-03-24 13:11:37

On 3/24/21 1:05 PM, Stefan Hajnoczi wrote:

On Tue, Mar 23, 2021 at 04:54:57PM -0400, Billy McFall wrote:

quoted

On Tue, Mar 23, 2021 at 3:52 PM Ilya Maximets [off-list ref] wrote:

quoted

On 3/23/21 6:57 PM, Adrian Moreno wrote:

quoted


On 3/19/21 6:21 PM, Stefan Hajnoczi wrote:

quoted

On Fri, Mar 19, 2021 at 04:29:21PM +0100, Ilya Maximets wrote:

quoted

On 3/19/21 3:05 PM, Stefan Hajnoczi wrote:

quoted

On Thu, Mar 18, 2021 at 08:47:12PM +0100, Ilya Maximets wrote:

quoted

On 3/18/21 6:52 PM, Stefan Hajnoczi wrote:

quoted

On Wed, Mar 17, 2021 at 09:25:26PM +0100, Ilya Maximets wrote:

quoted

And some housekeeping usually required for applications in case the
socket server terminated abnormally and socket files left on a file
system:
 "failed to bind to vhu: Address already in use; remove it and try

again"

quoted

quoted
quoted
quoted
quoted
quoted
QEMU avoids this by unlinking before binding. The drawback is that

users

quoted

might accidentally hijack an existing listen socket, but that can be
solved with a pidfile.

How exactly this could be solved with a pidfile?

A pidfile prevents two instances of the same service from running at

the

quoted

same time.

The same effect can be achieved by the container orchestrator,

systemd,

quoted

etc too because it refuses to run the same service twice.

Sure. I understand that.  My point was that these could be 2 different
applications and they might not know which process to look for.

quoted

And what if this is
a different application that tries to create a socket on a same path?
e.g. QEMU creates a socket (started in a server mode) and user
accidentally created dpdkvhostuser port in Open vSwitch instead of
dpdkvhostuserclient.  This way rte_vhost library will try to bind
to an existing socket file and will fail.  Subsequently port creation
in OVS will fail.   We can't allow OVS to unlink files because this
way OVS users will have ability to unlink random sockets that OVS has
access to and we also has no idea if it's a QEMU that created a file
or it was a virtio-user application or someone else.

If rte_vhost unlinks the socket then the user will find that

networking

quoted

doesn't work. They can either hot unplug the QEMU vhost-user-net

device

quoted

or restart QEMU, depending on whether they need to keep the guest
running or not. This is a misconfiguration that is recoverable.

True, it's recoverable, but with a high cost.  Restart of a VM is

rarely

quoted

desirable.  And the application inside the guest might not feel itself
well after hot re-plug of a device that it actively used.  I'd expect
a DPDK application that runs inside a guest on some virtio-net device
to crash after this kind of manipulations.  Especially, if it uses some
older versions of DPDK.

This unlink issue is probably something we think differently about.
There are many ways for users to misconfigure things when working with
system tools. If it's possible to catch misconfigurations that is
preferrable. In this case it's just the way pathname AF_UNIX domain
sockets work and IMO it's better not to have problems starting the
service due to stale files than to insist on preventing
misconfigurations. QEMU and DPDK do this differently and both seem to be
successful, so ¯\_(ツ)_/¯.

quoted

Regarding letting OVS unlink files, I agree that it shouldn't if this
create a security issue. I don't know the security model of OVS.

In general privileges of a ovs-vswitchd daemon might be completely
different from privileges required to invoke control utilities or
to access the configuration database.  SO, yes, we should not allow
that.

That can be locked down by restricting the socket path to a file beneath
/var/run/ovs/vhost-user/.

quoted

There are, probably, ways to detect if there is any alive process

that

quoted

has this socket open, but that sounds like too much for this purpose,
also I'm not sure if it's possible if actual user is in a different
container.
So I don't see a good reliable way to detect these conditions.  This
falls on shoulders of a higher level management software or a user to
clean these socket files up before adding ports.

Does OVS always run in the same net namespace (pod) as the DPDK
application? If yes, then abstract AF_UNIX sockets can be used.

Abstract

quoted

AF_UNIX sockets don't have a filesystem path and the socket address
disappears when there is no process listening anymore.

OVS is usually started right on the host in a main network namespace.
In case it's started in a pod, it will run in a separate container but
configured with a host network.  Applications almost exclusively runs
in separate pods.

Okay.

quoted

This patch-set aims to eliminate most of the inconveniences by
leveraging an infrastructure service provided by a SocketPair

Broker.

quoted

I don't understand yet why this is useful for vhost-user, where the
creation of the vhost-user device backend and its use by a VMM are
closely managed by one piece of software:

1. Unlink the socket path.
2. Create, bind, and listen on the socket path.
3. Instantiate the vhost-user device backend (e.g. talk to DPDK/SPDK
   RPC, spawn a process, etc) and pass in the listen fd.
4. In the meantime the VMM can open the socket path and call

connect(2).

quoted

   As soon as the vhost-user device backend calls accept(2) the
   connection will proceed (there is no need for sleeping).

This approach works across containers without a broker.

Not sure if I fully understood a question here, but anyway.

This approach works fine if you know what application to run.
In case of a k8s cluster, it might be a random DPDK application
with virtio-user ports running inside a container and want to
have a network connection.  Also, this application needs to run
virtio-user in server mode, otherwise restart of the OVS will
require restart of the application.  So, you basically need to
rely on a third-party application to create a socket with a right
name and in a correct location that is shared with a host, so
OVS can find it and connect.

In a VM world everything is much more simple, since you have
a libvirt and QEMU that will take care of all of these stuff
and which are also under full control of management software
and a system administrator.
In case of a container with a "random" DPDK application inside
there is no such entity that can help.  Of course, some solution
might be implemented in docker/podman daemon to create and manage
outside-looking sockets for an application inside the container,
but that is not available today AFAIK and I'm not sure if it
ever will.

Wait, when you say there is no entity like management software or a
system administrator, then how does OVS know to instantiate the new
port? I guess something still needs to invoke ovs-ctl add-port?

I didn't mean that there is no any application that configures
everything.  Of course, there is.  I mean that there is no such
entity that abstracts all that socket machinery from the user's
application that runs inside the container.  QEMU hides all the
details of the connection to vhost backend and presents the device
as a PCI device with a network interface wrapping from the guest
kernel.  So, the application inside VM shouldn't care what actually
there is a socket connected to OVS that implements backend and
forward traffic somewhere.  For the application it's just a usual
network interface.
But in case of a container world, application should handle all
that by creating a virtio-user device that will connect to some
socket, that has an OVS on the other side.

quoted

Can you describe the steps used today (without the broker) for
instantiating a new DPDK app container and connecting it to OVS?
Although my interest is in the vhost-user protocol I think it's
necessary to understand the OVS requirements here and I know little
about them.

quoted

I might describe some things wrong since I worked with k8s and CNI

plugins last time ~1.5 years ago, but the basic schema will look
something like this:

1. user decides to start a new pod and requests k8s to do that
   via cmdline tools or some API calls.

2. k8s scheduler looks for available resources asking resource
   manager plugins, finds an appropriate physical host and asks
   local to that node kubelet daemon to launch a new pod there.

When the CNI is called, the pod has already been created, i.e: a PodID

exists

quoted

and so does an associated network namespace. Therefore, everything that

has to

quoted

do with the runtime spec such as mountpoints or devices cannot be

modified by

quoted

this time.

That's why the Device Plugin API is used to modify the Pod's spec before

the CNI

quoted

chain is called.

quoted

3. kubelet asks local CNI plugin to allocate network resources
   and annotate the pod with required mount points, devices that
   needs to be passed in and environment variables.
   (this is, IIRC, a gRPC connection.   It might be a multus-cni
   or kuryr-kubernetes or any other CNI plugin.  CNI plugin is
   usually deployed as a system DaemonSet, so it runs in a
   separate pod.

4. Assuming that vhost-user connection requested in server mode.
   CNI plugin will:
   4.1 create a directory for a vhost-user socket.
   4.2 add this directory to pod annotations as a mount point.

I believe this is not possible, it would have to inspect the pod's spec

or

quoted

otherwise determine an existing mount point where the socket should be

created.

Uff.  Yes, you're right.  Thanks for your clarification.
I mixed up CNI and Device Plugin here.

CNI itself is not able to annotate new resources to the pod, i.e.
create new mounts or something like this.   And I don't recall any
vhost-user device plugins.  Is there any?  There is an SR-IOV device
plugin, but its purpose is to allocate and pass PCI devices, not create
mounts for vhost-user.

So, IIUC, right now user must create the directory and specify
a mount point in a pod spec file or pass the whole /var/run/openvswitch
or something like this, right?

Looking at userspace-cni-network-plugin, it actually just parses
annotations to find the shared directory and fails if there is
no any:

https://github.com/intel/userspace-cni-network-plugin/blob/master/userspace/userspace.go#L122

And examples suggests to specify a directory to mount:

https://github.com/intel/userspace-cni-network-plugin/blob/master/examples/ovs-vhost/userspace-ovs-pod-1.yaml#L41

Looks like this is done by user's hands.

Yes, I am one of the primary authors of Userspace CNI. Currently, the

directory is by hand. Long term thought was to have a mutating
webhook/admission controller inject a directory into the podspec.  Not sure
if it has changed, but I think when I was originally doing this work, OvS
only lets you choose the directory at install time, so it has to be
something like /var/run/openvswitch/. You can choose the socketfile name
and maybe a subdirectory off the main directory, but not the full path.

One of the issues I was trying to solve was making sure ContainerA couldn't
see ContainerB's socketfiles. That's where the admission controller could
create a unique subdirectory for each container under
/var/run/openvswitch/. But this was more of a PoC CNI and other work items
always took precedence so that work never completed.

If the CNI plugin has access to the container's network namespace, could
it create an abstract AF_UNIX listen socket?

That way the application inside the container could connect to an
AF_UNIX socket and there is no need to manage container volumes.

I'm not familiar with the Open VSwitch, so I'm not sure if there is a
sane way of passing the listen socket fd into ovswitchd from the CNI
plugin?

The steps:
1. CNI plugin enters container's network namespace and opens an abstract
   AF_UNIX listen socket.
2. CNI plugin passes the listen socket fd to OVS. This is the ovs-vsctl
   add-port step. Instead of using type=dpdkvhostuserclient
   options:vhost-server-path=/tmp/dpdkvhostclient0 it instead create a
   dpdkvhostuser server with the listen fd.

For this step you will need a side channel, i.e. a separate unix socket
created by ovs-vswitchd (most likely, created by rte_vhost on
rte_vhost_driver_register() call).

The problem is that ovs-vsctl talks with ovsdb-server and adds the new
port -- just a new row in the 'interface' table of the database.
ovs-vswitchd receives update from the database and creates the actual
port.  All the communications done through JSONRPC, so passing fds is
not an option.

3. When the container starts, it connects to the abstract AF_UNIX
   socket. The abstract socket name is provided to the container at
   startup time in an environment variable. The name is unique, at least
   to the pod, so that multiple containers in the pod can run vhost-user
   applications.

Few more problems with this solution:

- We still want to run application inside the container in a server mode,
  because virtio-user PMD in client mode doesn't support re-connection.

- How to get this fd again after the OVS restart?  CNI will not be invoked
  at this point to pass a new fd.

- If application will close the connection for any reason (restart, some
  reconfiguration internal to the application) and OVS will be re-started
  at the same time, abstract socket will be gone.  Need a persistent daemon
  to hold it.

Best regards, Ilya Maximets.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help