Thread (5 messages) 5 messages, 3 authors, 2021-07-08

Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

From: Stefan Hajnoczi <stefanha@redhat.com>
Date: 2021-07-08 09:06:26
Also in: kvm, linux-iommu, lkml, netdev, virtualization

Possibly related (same subject, not in this thread)

On Thu, Jul 08, 2021 at 12:17:56PM +0800, Jason Wang wrote:
在 2021/7/7 下午11:54, Stefan Hajnoczi 写道:
quoted
On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
quoted
在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
quoted
On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
quoted
在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
quoted
On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
quoted
On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi [off-list ref] wrote:
quoted
On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
quoted
在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
quoted
On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
quoted
在 2021/7/4 下午5:49, Yongji Xie 写道:
quoted
quoted
quoted
OK, I get you now. Since the VIRTIO specification says "Device
configuration space is generally used for rarely-changing or
initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
ioctl should not be called frequently.
The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.
Note that though you're correct, my understanding is that config space is
not suitable for this kind of error propagating. And it would be very hard
to implement such kind of semantic in some transports.  Virtqueue should be
much better. As Yong Ji quoted, the config space is used for
"rarely-changing or intialization-time parameters".

quoted
Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
handle the message failure, I'm going to add a return value to
virtio_config_ops.get() and virtio_cread_* API so that the error can
be propagated to the virtio device driver. Then the virtio-blk device
driver can be modified to handle that.

Jason and Stefan, what do you think of this way?
Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
quoted
I'd like to stick to the current assumption thich get_config won't fail.
That is to say,

1) maintain a config in the kernel, make sure the config space read can
always succeed
2) introduce an ioctl for the vduse usersapce to update the config space.
3) we can synchronize with the vduse userspace during set_config

Does this work?
I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?
Because:

1) Kernel can not wait forever in get_config(), this is the major difference
with vhost-user.
virtio_cread() can sleep:

     #define virtio_cread(vdev, structname, member, ptr)                     \
             do {                                                            \
                     typeof(((structname*)0)->member) virtio_cread_v;        \
                                                                             \
                     might_sleep();                                          \
                     ^^^^^^^^^^^^^^

Which code path cannot sleep?
Well, it can sleep but it can't sleep forever. For VDUSE, a
buggy/malicious userspace may refuse to respond to the get_config.

It looks to me the ideal case, with the current virtio spec, for VDUSE is to

1) maintain the device and its state in the kernel, userspace may sync
with the kernel device via ioctls
2) offload the datapath (virtqueue) to the userspace

This seems more robust and safe than simply relaying everything to
userspace and waiting for its response.

And we know for sure this model can work, an example is TUN/TAP:
netdevice is abstracted in the kernel and datapath is done via
sendmsg()/recvmsg().

Maintaining the config in the kernel follows this model and it can
simplify the device generation implementation.

For config space write, it requires more thought but fortunately it's
not commonly used. So VDUSE can choose to filter out the
device/features that depends on the config write.
This is the problem. There are other messages like SET_FEATURES where I
guess we'll face the same challenge.
Probably not, userspace device can tell the kernel about the device_features
and mandated_features during creation, and the feature negotiation could be
done purely in the kernel without bothering the userspace.
(For some reason I drop the list accidentally, adding them back, sorry)

quoted
Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous
interface where the driver waits for the device.
It depends on how we define "synchronous" here. If I understand correctly,
the spec doesn't expect there will be any kind of failure for the operation
of set_status itself.

Instead, anytime it want any synchronization, it should be done via
get_status():

1) re-read device status to make sure FEATURES_OK is set during feature
negotiation
2) re-read device status to be 0 to make sure the device has finish the
reset

quoted
VDUSE currently doesn't wait for the device emulation process to handle
this message (no reply is needed) but I think this is a mistake because
VDUSE is not following the VIRTIO device model.
With the trick that is done for FEATURES_OK above, I think we don't need to
wait for the reply.

If userspace takes too long to respond, it can be detected since
get_status() doesn't return the expected value for long time.

And for the case that needs a timeout, we probably can use NEEDS_RESET.
I think you're right. get_status is the synchronization point, not
set_status.

Currently there is no VDUSE GET_STATUS message. The
VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so
that the device emulation program can participate in emulating the
Device Status field.

I'm not sure I get this, but it is what has been done?

+static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
+{
+    struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+    bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
+
+    dev->status = status;
+
+    if (dev->started == started)
+        return;
+
+    dev->started = started;
+    if (dev->started) {
+        vduse_dev_start_dataplane(dev);
+    } else {
+        vduse_dev_reset(dev);
+        vduse_dev_stop_dataplane(dev);
+    }
+}


But the looks not correct:

1) !DRIVER_OK doesn't means a reset?
2) Need to deal with FEATURES_OK
I'm not sure if this reply was to me or to Yongji Xie?

Currently vduse_vdpa_set_status() does not allow the device emulation
program to participate fully in Device Status field changes. It hides
the status bits and only sends VDUSE_START/STOP_DATAPLANE.

I suggest having GET_STATUS/SET_STATUS messages instead, allowing the
device emulation program to handle these parts of the VIRTIO device
model (e.g. rejecting combinations of features that are mutually
exclusive).

Stefan

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help