Re: [PATCH net] virtio-net: enable all napis before scheduling refill work
From: Jason Wang <jasowang@redhat.com>
Date: 2025-12-09 04:30:34
Also in:
bpf, lkml, virtualization
On Mon, Dec 8, 2025 at 11:35 PM Bui Quang Minh [off-list ref] wrote:
quoted hunk ↗ jump to hunk
Calling napi_disable() on an already disabled napi can cause the deadlock. In commit 4bc12818b363 ("virtio-net: disable delayed refill when pausing rx"), to avoid the deadlock, when pausing the RX in virtnet_rx_pause[_all](), we disable and cancel the delayed refill work. However, in the virtnet_rx_resume_all(), we enable the delayed refill work too early before enabling all the receive queue napis. The deadlock can be reproduced by running selftests/drivers/net/hw/xsk_reconfig.py with multiqueue virtio-net device and inserting a cond_resched() inside the for loop in virtnet_rx_resume_all() to increase the success rate. Because the worker processing the delayed refilled work runs on the same CPU as virtnet_rx_resume_all(), a reschedule is needed to cause the deadlock. In real scenario, the contention on netdev_lock can cause the reschedule. This fixes the deadlock by ensuring all receive queue's napis are enabled before we enable the delayed refill work in virtnet_rx_resume_all() and virtnet_open(). Fixes: 4bc12818b363 ("virtio-net: disable delayed refill when pausing rx") Reported-by: Paolo Abeni <pabeni@redhat.com> Closes: https://netdev-ctrl.bots.linux.dev/logs/vmksft/drv-hw-dbg/results/400961/3-xdp-py/stderr Signed-off-by: Bui Quang Minh <redacted> --- drivers/net/virtio_net.c | 59 +++++++++++++++++++--------------------- 1 file changed, 28 insertions(+), 31 deletions(-)diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 8e04adb57f52..f2b1ea65767d 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c@@ -2858,6 +2858,20 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq, return err != -ENOMEM; } +static void virtnet_rx_refill_all(struct virtnet_info *vi) +{ + bool schedule_refill = false; + int i; + + enable_delayed_refill(vi);
This seems to be still racy?
For example, in virtnet_open() we had:
static int virtnet_open(struct net_device *dev)
{
struct virtnet_info *vi = netdev_priv(dev);
int i, err;
for (i = 0; i < vi->max_queue_pairs; i++) {
err = virtnet_enable_queue_pair(vi, i);
if (err < 0)
goto err_enable_qp;
}
virtnet_rx_refill_all(vi);
So NAPI and refill work is enabled in this case, so the refill work
could be scheduled and run at the same time?
Thanks