Thread (17 messages) 17 messages, 4 authors, 2023-02-15

Re: [PATCH net-next 0/5][pull request] add v2 FW logging for ice driver

From: Jakub Kicinski <kuba@kernel.org>
Date: 2023-02-14 23:19:18

On Tue, 14 Feb 2023 14:39:18 -0800 Jacob Keller wrote:
On 2/14/2023 8:14 AM, Paul M Stillwell Jr wrote:
quoted
quoted
I believe that's in line with devlink health. The devlink health log
is "formatted" but I really doubt that any user can get far in debugging
without vendor support.
 
I agree, I just don't see what the trigger is in our case for FW logging.
  
Here's the thoughts I had for devlink health:

1) support health reporters storing more than a single event. Currently
all health reporters respond to a single event and then do not allow
storing new captures until the current one is processed. This breaks for
our firmware logging because we get separate events from firmware for
each buffer of messages. We could make this configurable such that we
limit the total maximum to prevent kernel memory overrun. (and some
policy for how to discard events when the buffer is full?)
I think the idea is that the system keeps a continuous ring buffer of
logs and dumps it whenever bad events happens. That's in case of logs,
obviously you can expose other types of state with health.
2a) add some knobs to enable/disable a health reporter
For ad-hoc collection of the ring buffer there are dump and diagnose
callbacks which can be triggered at any time.
2b) add some firmware logging specific knobs as a "build on top of
health reporters" or by creating a separate firmware logging bit that
ties into a reporter. These knows would be how to set level, etc.
Right, the level setting is the part that I'm the least sure of.
That sounds like something more fitting to ethtool dumps.
3) for ice, once the health reporter is enabled we request the firmware
to send us logging, then we get our admin queue message and simply copy
this into the health reporter as a new event

4) user space is in charge of monitoring health reports and can decide
how to copy events out to disk and when to delete the health reports
from the kernel.
That's also out of what's expected with health reporters. User should
not have to run vendor tools with devlink health. Decoding of the dump
may require vendor tools but checking if system is healthy or something
crashed should happen without any user space involvement.
Basically: extend health reporters to allow multiple captures and add a
related module to configure firmware logging via a health reporter,
where the "event" is just "I have a new blob to store".

How does this sound?

For the specifics of 2b) I think we can probably agree that levels is
fairly generic (i.e. the specifics of what each level are is vendor
specific but the fact that there are numbers and that higher or lower
numbers means more severe is fairly standard)

I know the ice firmware has many such modules we can enable or disable
and we would ideally be able to set which modules are active or not.
However all messages come through in the same blobs so we can't separate
them and report them to individual health reporter events. I think we
could have modules as a separate option for toggling which ones are on
or off. I would expect other vendors to have something similar or have
no modules at all and just an on/off switch?
I bet all vendors at this point have separate modules in the FW.
It's been the case for a while, that's why we have multiple versions
supported in devlink dev info.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help