Thread (31 messages) 31 messages, 6 authors, 2024-06-11

RE: [PATCH net-next] virtio_net: Add TX stop and wake counters

From: Dan Jurgens <hidden>
Date: 2024-06-07 12:37:02
Also in: virtualization

From: Jason Xing <redacted>
Sent: Friday, June 7, 2024 4:16 AM
To: Dan Jurgens <redacted>
Cc: Jakub Kicinski <kuba@kernel.org>; Michael S. Tsirkin <mst@redhat.com>;
netdev@vger.kernel.org; jasowang@redhat.com;
xuanzhuo@linux.alibaba.com; virtualization@lists.linux.dev;
davem@davemloft.net; edumazet@google.com; abeni@redhat.com; Parav
Pandit [off-list ref]
Subject: Re: [PATCH net-next] virtio_net: Add TX stop and wake counters

On Sat, Feb 3, 2024 at 12:46 AM Daniel Jurgens [off-list ref] wrote:
quoted
quoted
From: Jakub Kicinski <kuba@kernel.org>
Sent: Friday, February 2, 2024 10:01 AM
Subject: Re: [PATCH net-next] virtio_net: Add TX stop and wake
counters

On Fri, 2 Feb 2024 14:52:59 +0800 Jason Xing wrote:
quoted
quoted
Can you say more? I'm curious what's your use case.
I'm not working at Nvidia, so my point of view may differ from theirs.
From what I can tell is that those two counters help me narrow
down the range if I have to diagnose/debug some issues.
right, i'm asking to collect useful debugging tricks, nothing
against the patch itself :)
quoted
1) I sometimes notice that if some irq is held too long (say, one
simple case: output of printk printed to the console), those two
counters can reflect the issue.
2) Similarly in virtio net, recently I traced such counters the
current kernel does not have and it turned out that one of the
output queues in the backend behaves badly.
...

Stop/wake queue counters may not show directly the root cause of
the issue, but help us 'guess' to some extent.
I'm surprised you say you can detect stall-related issues with this.
I guess virtio doesn't have BQL support, which makes it special.
Normal HW drivers with BQL almost never stop the queue by themselves.
I mean - if they do, and BQL is active, then the system is probably
misconfigured (queue is too short). This is what we use at Meta to
detect stalls in drivers with BQL:

https://lore.kernel.org/all/20240131102150.728960-3-leitao@debian.or
g/

Daniel, I think this may be a good enough excuse to add per-queue
stats to the netdev genl family, if you're up for that. LMK if you
want more info, otherwise I guess ethtool -S is fine for now.
For now, I would like to go with ethtool -S. But I can take on the netdev
level as a background task. Are you suggesting adding them to
rtnl_link_stats64?

Hello Daniel, Jakub,

Sorry to revive this thread. I wonder why not use this patch like mlnx driver
does instead of adding statistics into the yaml file? Are we gradually using or
adding more fields into the yaml file to replace the 'ethtool -S' command?
It's trivial to have the stats in ethtool as well. But I noticed the stats series intentionally removed some stats from ethtool. So I didn't put it both places.
Thanks,
Jason
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help