Re: [PATCH] pci-driver: Add driver load messages
From: Prarit Bhargava <hidden>
Date: 2021-03-04 14:45:08
Also in:
linux-pci
On 2/18/21 2:06 PM, Bjorn Helgaas wrote:
On Thu, Feb 18, 2021 at 01:36:35PM -0500, Prarit Bhargava wrote:quoted
On 1/26/21 10:12 AM, Bjorn Helgaas wrote:quoted
On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote:quoted
On 1/26/21 8:53 AM, Leon Romanovsky wrote:quoted
On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote:quoted
On 1/26/21 8:14 AM, Leon Romanovsky wrote:quoted
On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote:quoted
Leon Romanovsky [off-list ref] wrote:quoted
On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote:quoted
There are two situations where driver load messages are helpful. 1) Some drivers silently load on devices and debugging driver or system failures in these cases is difficult. While some drivers (networking for example) may not completely initialize when the PCI driver probe() function has returned, it is still useful to have some idea of driver completion.Sorry, probably it is me, but I don't understand this use case. Are you adding global to whole kernel command line boot argument to debug what and when? During boot: If device success, you will see it in /sys/bus/pci/[drivers|devices]/*. If device fails, you should get an error from that device (fix the device to return an error), or something immediately won't work and you won't see it in sysfs.What if there is a panic during boot? There's no way to get to sysfs. That's the case where this is helpful.How? If you have kernel panic, it means you have much more worse problem than not-supported device. If kernel panic was caused by the driver, you will see call trace related to it. If kernel panic was caused by something else, supported/not supported won't help here.I still have no idea *WHICH* device it was that the panic occurred on.The kernel panic is printed from the driver. There is one driver loaded for all same PCI devices which are probed without relation to their number.> If you have host with ten same cards, you will see one driver and this is where the problem and not in supported/not-supported device.That's true, but you can also have different cards loading the same driver. See, for example, any PCI_IDs list in a driver. For example, 10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02) 20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02) Both load the megaraid driver and have different profiles within the driver. I have no idea which one actually panicked until removing one card. It's MUCH worse when debugging new hardware and getting a panic from, for example, the uncore code which binds to a PCI mapped device. One device might work and the next one doesn't. And then you can multiply that by seeing *many* panics at once and trying to determine if the problem was on one specific socket, die, or core.Would a dev_panic() interface that identified the device and driver help with this?^^ the more I look at this problem, the more a dev_panic() that would output a device specific message at panic time is what I really need.
Bjorn,
I went down this road a bit and had a realization. The issue isn't with
printing something at panic time, but the *data* that is output. Each PCI
device is associated with a struct device. That device struct's name is output
for dev_dbg(), etc., commands. The PCI subsystem sets the device struct name at
drivers/pci/probe.c: 1799
dev_set_name(&dev->dev, "%04x:%02x:%02x.%d", pci_domain_nr(dev->bus),
dev->bus->number, PCI_SLOT(dev->devfn),
PCI_FUNC(dev->devfn));
My problem really is that the above information is insufficient when I (or a
user) need to debug a system. The complexities of debugging multiple broken
driver loads would be much easier if I didn't have to constantly add this output
manually :).
Would you be okay with adding a *debug* parameter to expand the device name to
include the vendor & device ID pair? FWIW, I'm somewhat against
yet-another-kernel-option but that's really the information I need. I could
then add dev_dbg() statements in the local_pci_probe() function.
Thoughts?
P.