RE: [PATCH v3 2/6] PCI: hv: Fix a race condition in hv_irq_unmask() that can... | netdev

RE: [PATCH v3 2/6] PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic

From: Dexuan Cui <decui@microsoft.com>
Date: 2023-06-15 04:27:56
Also in: linux-hyperv, linux-pci, linux-rdma, lkml, stable

From: Lorenzo Pieralisi <lpieralisi@kernel.org>
Sent: Thursday, May 25, 2023 3:16 AM

quoted

--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c

@@ -643,6 +643,11 @@ static void hv_arch_irq_unmask(struct irq_data
 *data)
 	pbus = pdev->bus;
 	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
 	int_desc = data->chip_data;
+	if (!int_desc) {
+		dev_warn(&hbus->hdev->device, "%s() can not unmask irq %u\n",
+			 __func__, data->irq);
+		return;
+	}

That's a check that should be there regardless ?

Yes. Normally data->chip_data is set at the end of hv_compose_msi_msg(),
and it should not be NULL. However, in rare circumstances, we might see a
failure in hv_compose_msi_msg(), e.g. the hypervisor/host might return an
error in comp.comp_pkt.completion_status (at least this is possible in theory).

In case we see a failure in hv_compose_msi_msg(), data->chip_data stays
with its default value of NULL; because the return type of
hv_compose_msi_msg() is "void", we can not return an error to the upper
layer; later, when the upper layer calls request_irq() -> ... -> hv_irq_unmask(),
hv_arch_irq_unmask() crashes because data->chip_data is NULL -- with this
check, we're able to error out gracefully, and the user can better understand
what goes wrong.

quoted

 	spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);

@@ -1911,12 +1916,6 @@ static void hv_compose_msi_msg(struct

irq_data *data, struct msi_msg *msg)

quoted

 		hv_pci_onchannelcallback(hbus);
 		spin_unlock_irqrestore(&channel->sched_lock, flags);

-		if (hpdev->state == hv_pcichild_ejecting) {
-			dev_err_once(&hbus->hdev->device,
-				     "the device is being ejected\n");
-			goto enable_tasklet;
-		}
-
 		udelay(100);
 	}

I don't understand why this code is in hv_compose_msi_msg() in the first
place (and why only in that function ?) to me this looks like you are
adding plasters in the code that can turn out to be problematic while
ejecting a device, this does not seem robust at all - that's my opinion.

The code was incorrectly added by
de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")

de0aa7b2f97d says

"
2. If the host is ejecting the VF device before we reach
    hv_compose_msi_msg(), in a UP VM, we can hang in hv_compose_msi_msg()
    forever, because at this time the host doesn't respond to the
    CREATE_INTERRUPT request.
"

de0aa7b2f97d implies that the host doesn't respond to the guest's
CREATE_INTERRUPT request once the guest receives the PCI_EJECT message -- this
is incorrect: after the guest receives the PCI_EJECT message, actually the host
still responds to the guest's request, as long as the guest sends the request within
1 minute AND the guest doesn't send a PCI_EJECTION_COMPLETE message to
the host in hv_eject_device_work(). The real issue is that currently the guest
can send PCI_EJECTION_COMPLETE to the host before the guest finishes the
device-probing/removing handling -- once the guest sends PCI_EJECTION_COMPLETE,
the host unassigns the PCI device from the guest and ignores any request
from the guest.

So here the check "hpdev->state == hv_pcichild_ejecting" is incorrect. We
should remove the check since it can cause a panic (see the commit messsage
for the detailed explanation)

The "premature PCI_EJECTION_COMPLETE" issue is resolved by:
[PATCH v3 5/6] PCI: hv: Add a per-bus mutex state_lock

Feel free to merge this code, I can't ACK it, sorry.

Lorenzo

Thanks for sharing the thougths!

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help