Thread (14 messages) 14 messages, 3 authors, 2021-03-05

Re: [PATCH] pci-driver: Add driver load messages

From: Prarit Bhargava <hidden>
Date: 2021-02-18 19:19:06
Also in: linux-pci


On 1/26/21 10:12 AM, Bjorn Helgaas wrote:
Hi Prarit,

On Tue, Jan 26, 2021 at 09:05:23AM -0500, Prarit Bhargava wrote:
quoted
On 1/26/21 8:53 AM, Leon Romanovsky wrote:
quoted
On Tue, Jan 26, 2021 at 08:42:12AM -0500, Prarit Bhargava wrote:
quoted
On 1/26/21 8:14 AM, Leon Romanovsky wrote:
quoted
On Tue, Jan 26, 2021 at 07:54:46AM -0500, Prarit Bhargava wrote:
quoted
  Leon Romanovsky [off-list ref] wrote:
quoted
On Mon, Jan 25, 2021 at 02:41:38PM -0500, Prarit Bhargava wrote:
quoted
There are two situations where driver load messages are helpful.

1) Some drivers silently load on devices and debugging driver or system
failures in these cases is difficult.  While some drivers (networking
for example) may not completely initialize when the PCI driver probe() function
has returned, it is still useful to have some idea of driver completion.
Sorry, probably it is me, but I don't understand this use case.
Are you adding global to whole kernel command line boot argument to debug
what and when?

During boot:
If device success, you will see it in /sys/bus/pci/[drivers|devices]/*.
If device fails, you should get an error from that device (fix the
device to return an error), or something immediately won't work and
you won't see it in sysfs.
What if there is a panic during boot?  There's no way to get to sysfs.
That's the case where this is helpful.
How? If you have kernel panic, it means you have much more worse problem
than not-supported device. If kernel panic was caused by the driver, you
will see call trace related to it. If kernel panic was caused by
something else, supported/not supported won't help here.
I still have no idea *WHICH* device it was that the panic occurred on.
The kernel panic is printed from the driver. There is one driver loaded
for all same PCI devices which are probed without relation to their
number.>
If you have host with ten same cards, you will see one driver and this
is where the problem and not in supported/not-supported device.
That's true, but you can also have different cards loading the same driver.
See, for example, any PCI_IDs list in a driver.

For example,

10:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02)
20:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)

Both load the megaraid driver and have different profiles within the
driver.  I have no idea which one actually panicked until removing
one card.

It's MUCH worse when debugging new hardware and getting a panic
from, for example, the uncore code which binds to a PCI mapped
device.  One device might work and the next one doesn't.  And then
you can multiply that by seeing *many* panics at once and trying to
determine if the problem was on one specific socket, die, or core.
Would a dev_panic() interface that identified the device and driver
help with this?
^^ the more I look at this problem, the more a dev_panic() that would output a
device specific message at panic time is what I really need.
For driver_load_messages, it doesn't seem necessarily PCI-specific.
If we want a message like that, maybe it could be in
driver_probe_device() or similar?  There are already a few pr_debug()
calls in that path.  There are some enabled by initcall_debug that
include the return value from the probe; would those be close to what
you're looking for?
I took a look at those, and unfortunately they do not meet my requirements.
Ultimately, at panic time, I need to know that a driver was loaded on a device
at a specific location in the PCI space.

The driver_probe_device() pr_debug calls tell me the location and the driver,
but not anything to uniquely identify the device (ie, the PCI vendor and device
IDs).

It sounds like you've had some thoughts about a dev_panic() implementation.
Care to share them with me?  I'm more than willing to implement it but just want
to get your more experienced view of what is needed.

P.
Bjorn
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help