Thread (14 messages) 14 messages, 3 authors, 2021-03-05

Re: [PATCH] pci-driver: Add driver load messages

From: Prarit Bhargava <hidden>
Date: 2021-03-04 14:45:08
Also in: linux-pci


On 2/18/21 2:06 PM, Bjorn Helgaas wrote:
On Thu, Feb 18, 2021 at 01:36:35PM -0500, Prarit Bhargava wrote:
quoted
On 1/26/21 10:12 AM, Bjorn Helgaas wrote:
quoted
On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote:
quoted
On 1/26/21 8:53 AM, Leon Romanovsky wrote:
quoted
On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote:
quoted
On 1/26/21 8:14 AM, Leon Romanovsky wrote:
quoted
On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote:
quoted
  Leon Romanovsky [off-list ref] wrote:
quoted
On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote:
quoted
There are two situations where driver load messages are helpful.

1) Some drivers silently load on devices and debugging driver or system
failures in these cases is difficult.  While some drivers (networking
for example) may not completely initialize when the PCI driver probe() function
has returned, it is still useful to have some idea of driver completion.
Sorry, probably it is me, but I don't understand this use case.
Are you adding global to whole kernel command line boot argument to debug
what and when?

During boot:
If device success, you will see it in /sys/bus/pci/[drivers|devices]/*.
If device fails, you should get an error from that device (fix the
device to return an error), or something immediately won't work and
you won't see it in sysfs.
What if there is a panic during boot?  There's no way to get to sysfs.
That's the case where this is helpful.
How? If you have kernel panic, it means you have much more worse problem
than not-supported device. If kernel panic was caused by the driver, you
will see call trace related to it. If kernel panic was caused by
something else, supported/not supported won't help here.
I still have no idea *WHICH* device it was that the panic occurred on.
The kernel panic is printed from the driver. There is one driver loaded
for all same PCI devices which are probed without relation to their
number.>
If you have host with ten same cards, you will see one driver and this
is where the problem and not in supported/not-supported device.
That's true, but you can also have different cards loading the same driver.
See, for example, any PCI_IDs list in a driver.

For example,

10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02)
20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)

Both load the megaraid driver and have different profiles within the
driver.  I have no idea which one actually panicked until removing
one card.

It's MUCH worse when debugging new hardware and getting a panic
from, for example, the uncore code which binds to a PCI mapped
device.  One device might work and the next one doesn't.  And
then you can multiply that by seeing *many* panics at once and
trying to determine if the problem was on one specific socket,
die, or core.
Would a dev_panic() interface that identified the device and
driver help with this?
^^ the more I look at this problem, the more a dev_panic() that
would output a device specific message at panic time is what I
really need.
Bjorn,

I went down this road a bit and had a realization.  The issue isn't with
printing something at panic time, but the *data* that is output.  Each PCI
device is associated with a struct device.  That device struct's name is output
for dev_dbg(), etc., commands.  The PCI subsystem sets the device struct name at
drivers/pci/probe.c: 1799

	        dev_set_name(&dev->dev, "%04x:%02x:%02x.%d", pci_domain_nr(dev->bus),
                     dev->bus->number, PCI_SLOT(dev->devfn),
                     PCI_FUNC(dev->devfn));

My problem really is that the above information is insufficient when I (or a
user) need to debug a system.  The complexities of debugging multiple broken
driver loads would be much easier if I didn't have to constantly add this output
manually :).

Would you be okay with adding a *debug* parameter to expand the device name to
include the vendor & device ID pair?  FWIW, I'm somewhat against
yet-another-kernel-option but that's really the information I need.  I could
then add dev_dbg() statements in the local_pci_probe() function.

Thoughts?

P.


Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help