Re: [PATCH net-next 0/5][pull request] add v2 FW logging for ice driver

From: Jacob Keller <jacob.e.keller@intel.com>
Date: 2023-02-14 22:39:29


On 2/14/2023 8:14 AM, Paul M Stillwell Jr wrote:

On 2/13/2023 4:40 PM, Jakub Kicinski wrote:

quoted

On Mon, 13 Feb 2023 15:46:53 -0800 Paul M Stillwell Jr wrote:

quoted

On 2/10/2023 8:23 PM, Jakub Kicinski wrote:

quoted

Can you describe how this is used a little bit?
The FW log is captured at some level always (e.g. warns)
or unless user enables _nothing_ will come out?

My understanding is that the FW is constantly logging data into internal
buffers. When the user indicates what data they want and what level they
want then the data is filtered and output via either the UART or the
Admin queues. These patches retrieve the FW logs via the admin queue
commands.

What's the trigger to perform the collection?

If it's some error condition / assert in FW then maybe it's worth
wrapping it up (or at least some portion of the functionality) into
devlink health?

The trigger is the user asking to collect the FW logs. There isn't 
anything within the FW that triggers the logging; generally there is 
some issue on the user side and we think there may be some issue in the 
FW or that FW can provide more info on what is going on so we request FW 
logs. As an example, sometimes users report issues with link flap and we 
request FW logs to see what the FW link management code thinks is 
happening. In this example there is no "error" per se, but the user is 
seeing some undesired behavior and we are looking for more information 
on what could be going on.

quoted

AFAIU the purpose of devlink health is exactly to bubble up to the host
asserts / errors / crashes in the FW, with associated "dump".

Maybe it is, but when I look at devlink health it doesn't seem like it 
is designed for something like this. It looks like (based on my reading 
of the documentation) that it responds to errors from the device; that's 
not really what is happening in our case. The user is seeing some 
behavior that they don't like and we are asking the FW to shed some 
light on what the FW thinks is happening.

Link flap is an excellent example of this. The FW is doing what it 
believes to be the correct thing, but due to some change on the link 
partner that the FW doesn't handle correctly then there is some issue. 
This is a classic bug, the code thinks it's doing the correct thing and 
in reality it is not.

In the above example nothing on the device is reporting an error so I 
don't see how the health reporter would get triggered.

Also, devlink health seems like it is geared towards a model of the 
device has an error, the error gets reported to the driver, the driver 
gets some info to report to the user, and the driver moves on. The FW 
logging is different from that in that we want to see data across a long 
period of time generally because we can't always pinpoint the time that 
the thing we want to see happened.

quoted

The output from the FW is a binary blob that a user would send back to
Intel to be decoded. This is only used for troubleshooting issues where
a user is working with someone from Intel on a specific problem.

I believe that's in line with devlink health. The devlink health log
is "formatted" but I really doubt that any user can get far in debugging
without vendor support.

I agree, I just don't see what the trigger is in our case for FW logging.

Here's the thoughts I had for devlink health:

1) support health reporters storing more than a single event. Currently
all health reporters respond to a single event and then do not allow
storing new captures until the current one is processed. This breaks for
our firmware logging because we get separate events from firmware for
each buffer of messages. We could make this configurable such that we
limit the total maximum to prevent kernel memory overrun. (and some
policy for how to discard events when the buffer is full?)

2a) add some knobs to enable/disable a health reporter

2b) add some firmware logging specific knobs as a "build on top of
health reporters" or by creating a separate firmware logging bit that
ties into a reporter. These knows would be how to set level, etc.

3) for ice, once the health reporter is enabled we request the firmware
to send us logging, then we get our admin queue message and simply copy
this into the health reporter as a new event

4) user space is in charge of monitoring health reports and can decide
how to copy events out to disk and when to delete the health reports
from the kernel.

Basically: extend health reporters to allow multiple captures and add a
related module to configure firmware logging via a health reporter,
where the "event" is just "I have a new blob to store".

How does this sound?

For the specifics of 2b) I think we can probably agree that levels is
fairly generic (i.e. the specifics of what each level are is vendor
specific but the fact that there are numbers and that higher or lower
numbers means more severe is fairly standard)

I know the ice firmware has many such modules we can enable or disable
and we would ideally be able to set which modules are active or not.
However all messages come through in the same blobs so we can't separate
them and report them to individual health reporter events. I think we
could have modules as a separate option for toggling which ones are on
or off. I would expect other vendors to have something similar or have
no modules at all and just an on/off switch?

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help