Re: [PATCH 3/4] virtio_ring: introduce a per virtqueue waitqueue

From: Jason Wang <jasowang@redhat.com>
Date: 2023-01-31 03:26:09
Also in: lkml, virtualization

On Mon, Jan 30, 2023 at 7:18 PM Michael S. Tsirkin [off-list ref] wrote:

On Mon, Jan 30, 2023 at 03:44:24PM +0800, Jason Wang wrote:

quoted

On Mon, Jan 30, 2023 at 1:43 PM Michael S. Tsirkin [off-list ref] wrote:

quoted

On Mon, Jan 30, 2023 at 10:53:54AM +0800, Jason Wang wrote:

quoted

On Sun, Jan 29, 2023 at 3:30 PM Michael S. Tsirkin [off-list ref] wrote:

quoted

On Sun, Jan 29, 2023 at 01:48:49PM +0800, Jason Wang wrote:

quoted

On Fri, Jan 27, 2023 at 6:35 PM Michael S. Tsirkin [off-list ref] wrote:

quoted

On Fri, Dec 30, 2022 at 11:43:08AM +0800, Jason Wang wrote:

quoted

On Thu, Dec 29, 2022 at 4:10 PM Michael S. Tsirkin [off-list ref] wrote:

quoted

On Thu, Dec 29, 2022 at 04:04:13PM +0800, Jason Wang wrote:

quoted

On Thu, Dec 29, 2022 at 3:07 PM Michael S. Tsirkin [off-list ref] wrote:

quoted

On Wed, Dec 28, 2022 at 07:53:08PM +0800, Jason Wang wrote:

quoted

On Wed, Dec 28, 2022 at 2:34 PM Jason Wang [off-list ref] wrote:

quoted


在 2022/12/27 17:38, Michael S. Tsirkin 写道:

quoted

On Tue, Dec 27, 2022 at 05:12:58PM +0800, Jason Wang wrote:

quoted

在 2022/12/27 15:33, Michael S. Tsirkin 写道:

quoted

On Tue, Dec 27, 2022 at 12:30:35PM +0800, Jason Wang wrote:

quoted

But device is still going and will later use the buffers.

Same for timeout really.

Avoiding infinite wait/poll is one of the goals, another is to sleep.
If we think the timeout is hard, we can start from the wait.

Thanks

If the goal is to avoid disrupting traffic while CVQ is in use,
that sounds more reasonable. E.g. someone is turning on promisc,
a spike in CPU usage might be unwelcome.

Yes, this would be more obvious is UP is used.

quoted

things we should be careful to address then:
1- debugging. Currently it's easy to see a warning if CPU is stuck
     in a loop for a while, and we also get a backtrace.
     E.g. with this - how do we know who has the RTNL?
     We need to integrate with kernel/watchdog.c for good results
     and to make sure policy is consistent.

That's fine, will consider this.

So after some investigation, it seems the watchdog.c doesn't help. The
only export helper is touch_softlockup_watchdog() which tries to avoid
triggering the lockups warning for the known slow path.

I never said you can just use existing exporting APIs. You'll have to
write new ones :)

Ok, I thought you wanted to trigger similar warnings as a watchdog.

Btw, I wonder what kind of logic you want here. If we switch to using
sleep, there won't be soft lockup anymore. A simple wait + timeout +
warning seems sufficient?

Thanks

I'd like to avoid need to teach users new APIs. So watchdog setup to apply
to this driver. The warning can be different.

Right, so it looks to me the only possible setup is the
watchdog_thres. I plan to trigger the warning every watchdog_thres * 2
second (as softlockup did).

And I think it would still make sense to fail, we can start with a
very long timeout like 1 minutes and break the device. Does this make
sense?

Thanks

I'd say we need to make this manageable then.

Did you mean something like sysfs or module parameters?

No I'd say pass it with an ioctl.

quoted

Can't we do it normally
e.g. react to an interrupt to return to userspace?

I didn't get the meaning of this. Sorry.

Thanks

Standard way to handle things that can timeout and where userspace
did not supply the time is to block until an interrupt
then return EINTR.

Well this seems to be a huge change, ioctl(2) doesn't say it can
return EINTR now.

the one on fedora 37 does not but it says:
       No single standard.  Arguments, returns, and semantics of ioctl() vary according to the device driver in question (the call  is
       used as a catch-all for operations that don't cleanly fit the UNIX stream I/O model).

so it depends on the device e.g. for a streams device it does:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/ioctl.html
has EINTR.

Ok, I saw signal(7) also mention about EINTR for ioctl(2):

"""
       If  a  blocked call to one of the following interfaces is
interrupted by a signal handler, then the call is automatically
restarted after the signal handler re‐
       turns if the SA_RESTART flag was used; otherwise the call fails
with the error EINTR:

       * read(2), readv(2), write(2), writev(2), and ioctl(2) calls on
"slow" devices.  A "slow" device is one where the I/O call may block
for an indefinite time, for
         example,  a  terminal,  pipe, or socket.  If an I/O call on a
slow device has already transferred some data by the time it is
interrupted by a signal handler,
         then the call will return a success status (normally, the
number of bytes transferred).  Note that a (local) disk is not a slow
device according to this defi‐
         nition; I/O operations on disk devices are not interrupted by signals.
"""


And note that if you interrupt then you don't know whether ioctl
changed device state or not generally.

Yes.

quoted

Actually, a driver timeout is used by other drivers when using
controlq/adminq (e.g i40e). Starting from a sane value (e.g 1 minutes
to avoid false negatives) seems to be a good first step.

Well because it's specific hardware so timeout matches what it can
promise.  virtio spec does not give guarantees.  One issue is with
software implementations. At the moment I can set a breakpoint in qemu
or vhost user backend and nothing bad happens in just continues.

Yes but it should be no difference from using a kgdb to debug i40e drivers.

Except one of the reasons people prefer programming in userspace is
because debugging is so much less painful. Someone using kgdb
knows what driver is doing and can work around that.

Ok.

quoted

Userspace controls the timeout by
using e.g. alarm(2).

Not used in iproute2 after a git grep.

Thanks

No need for iproute2 to do it user can just do it from shell. Or user can just press CTRL-C.

Yes, but iproute2 needs to deal with EINTR, that is the challenge
part, if we simply return an error, the next cvq command might get
confused.

Thanks

You mean this:
        start command
        interrupt
        start next command

?

next command is confused?
I think if you try a new command until previous
one finished it's ok to just return EBUSY.

That would be fine.

And we go back to somehow the idea here:

https://lore.kernel.org/all/CACGkMEvQwhOhgGW6F22+3vmR4AW90qYXF+ZO6BQZguUF2xt2SA@mail.gmail.com/T/#m2da63932eae775d7d05d93d44c2f1d115ffbcefe (local)

Will try to do that in the next version.

Thanks

quoted

And before the patch, we end up with a real infinite loop which could
be caught by RCU stall detector which is not the case of the sleep.
What we can do is probably do a periodic netdev_err().

Thanks

Only with a bad device.

quoted

2- overhead. In a very common scenario when device is in hypervisor,
     programming timers etc has a very high overhead, at bootup
     lots of CVQ commands are run and slowing boot down is not nice.
     let's poll for a bit before waiting?

Then we go back to the question of choosing a good timeout for poll. And
poll seems problematic in the case of UP, scheduler might not have the
chance to run.

Poll just a bit :) Seriously I don't know, but at least check once
after kick.


I think it is what the current code did where the condition will be
check before trying to sleep in the wait_event().

quoted

3- suprise removal. need to wake up thread in some way. what about
     other cases of device breakage - is there a chance this
     introduces new bugs around that? at least enumerate them please.

The current code did:

1) check for vq->broken
2) wakeup during BAD_RING()

So we won't end up with a never woke up process which should be fine.

Thanks

BTW BAD_RING on removal will trigger dev_err. Not sure that is a good
idea - can cause crashes if kernel panics on error.


Yes, it's better to use __virtqueue_break() instead.

But consider we will start from a wait first, I will limit the changes
in virtio-net without bothering virtio core.

Thanks

quoted

quoted
quoted

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help