Thread (12 messages) 12 messages, 3 authors, 2023-08-03

Re: [PATCH] vdpa/mlx5: Fix crash on shutdown for when no ndev exists

From: Dragos Tatulea via Virtualization <hidden>
Date: 2023-08-02 07:59:39
Also in: lkml, stable

On Wed, 2023-08-02 at 10:51 +0800, Jason Wang wrote:
On Tue, Aug 1, 2023 at 4:17 PM Dragos Tatulea [off-list ref] wrote:
quoted
On Tue, 2023-08-01 at 11:59 +0800, Jason Wang wrote:
quoted
On Mon, Jul 31, 2023 at 5:08 PM Michael S. Tsirkin [off-list ref] wrote:
quoted
On Mon, Jul 31, 2023 at 07:15:31AM +0000, Dragos Tatulea wrote:
quoted
On Thu, 2023-07-27 at 12:28 -0400, Michael S. Tsirkin wrote:
quoted
On Thu, Jul 27, 2023 at 04:02:16PM +0000, Dragos Tatulea wrote:
quoted
On Wed, 2023-07-26 at 15:26 -0400, Michael S. Tsirkin wrote:
quoted
On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote:
quoted
The ndev was accessed on shutdown without a check if it
actually
exists.
This triggered the crash pasted below. This patch simply adds
a
check
before using ndev.

 BUG: kernel NULL pointer dereference, address:
0000000000000300
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP
 CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.5.0-
rc2_for_upstream_min_debug_2023_07_17_15_05 #1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-
1.13.0-0-
gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa]
 RSP: 0018:ffff8881003bfdc0 EFLAGS: 00010286
 RAX: ffff888103befba0 RBX: ffff888109d28008 RCX:
0000000000000017
 RDX: 0000000000000001 RSI: 0000000000000212 RDI:
ffff888109d28000
 RBP: 0000000000000000 R08: 0000000d3a3a3882 R09:
0000000000000001
 R10: 0000000000000000 R11: 0000000000000000 R12:
ffff888109d28000
 R13: ffff888109d28080 R14: 00000000fee1dead R15:
0000000000000000
 FS:  00007f4969e0be40(0000) GS:ffff88852c800000(0000)
knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000300 CR3: 00000001051cd006 CR4:
0000000000370eb0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
 Call Trace:
  <TASK>
  ? __die+0x20/0x60
  ? page_fault_oops+0x14c/0x3c0
  ? exc_page_fault+0x75/0x140
  ? asm_exc_page_fault+0x22/0x30
  ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa]
  device_shutdown+0x13e/0x1e0
  kernel_restart+0x36/0x90
  __do_sys_reboot+0x141/0x210
  ? vfs_writev+0xcd/0x140
  ? handle_mm_fault+0x161/0x260
  ? do_writev+0x6b/0x110
  do_syscall_64+0x3d/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f496990fb56
 RSP: 002b:00007fffc7bdde88 EFLAGS: 00000206 ORIG_RAX:
00000000000000a9
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007f496990fb56
 RDX: 0000000001234567 RSI: 0000000028121969 RDI:
fffffffffee1dead
 RBP: 00007fffc7bde1d0 R08: 0000000000000000 R09:
0000000000000000
 R10: 0000000000000000 R11: 0000000000000206 R12:
0000000000000000
 R13: 00007fffc7bddf10 R14: 0000000000000000 R15:
00007fffc7bde2b8
  </TASK>
 CR2: 0000000000000300
 ---[ end trace 0000000000000000 ]---

Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 9138ef2fb2c8..e2e7ebd71798 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct
auxiliary_device
*auxdev)
        mgtdev = auxiliary_get_drvdata(auxdev);
        ndev = mgtdev->ndev;

-       free_irqs(ndev);
+       if (ndev)
+               free_irqs(ndev);
 }
something I don't get:
irqs are allocated in mlx5_vdpa_dev_add
why are they not freed in mlx5_vdpa_dev_del?
That is a good point. I will try to find out. I also don't get why
free_irq
is
called in the vdpa dev .free op instead of mlx5_vdpa_dev_del.
Maybe I
can
change
that in a different refactoring.
as it is I have no idea whether e.g. ndev can change
between these two call sites. that would make the check
pointless.
quoted
quoted
this is what's creating all this mess.
Not quite: mlx5_vdpa_dev_del (which is a .dev_del of for struct
vdpa_mgmtdev_ops) doesn't get called on shutdown. At least that's
what
I
see. Or
am I missing something?
and why do we care whether irqs are freed on shutdown?
Had to ask around a bit to find out the answer: there can be issues
with
kexec
IRQ allocation on some platforms. It is documented here [0] for
mlx5_core.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/main.c#n2129

Thanks,
Dragos
It's quite weird.
         * Some platforms requiring freeing the IRQ's in the shutdown
         * flow. If they aren't freed they can't be allocated after
         * kexec. There is no need to cleanup the mlx5_core software
         * contexts.

but most drivers don't have a shutdown callback how do they work then?
do you know which platforms these are?
I don't. x86_64 is not one of them though. I will do some more digging ...
Turns out that this fix (releasing the irqs on .shutdown on mlx5_core) was
required for PPC arch but only for certain mainframe systems. That's all the
info I could find.
quoted
quoted
There used to be bzs that requires virtio drivers to add a shutdown to
fix kexec:

https://bugzilla.redhat.com/show_bug.cgi?id=2108406
I don't have access to this. What is it about?
This bug might be more accurate:

https://bugzilla.redhat.com/show_bug.cgi?id=1820521

It's about the kexec guys (cced relevant people) wanting to add a
shutdown method for virito to fix potential kexec issues.

Thanks
quoted
Thanks,
Dragos
quoted
Thanks
quoted
I don't really know much about why shutdown callback is even necessary.
I guess this is to detect shutdown and do a faster cleanup than
the slow, graceful removal, just cleaning hardware resources?
.shutdown could be removed in mlx5_vdpa. But I notice that mlx5_core's .shutdown
kicks in from pci_device_shutdown to clean the irqs. So the irqs will still be
freed but as a side effect. Which is not good.

Thanks,
Dragos
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help