Thread (13 messages) 13 messages, 3 authors, 2020-10-08

Re: [RFC 0/3] staging: qlge: Re-writing the debugging features

From: Coiby Xu <hidden>
Date: 2020-10-08 12:52:24

On Thu, Aug 27, 2020 at 05:54:43PM +0800, Shung-Hsi Yu wrote:
On Wed, Aug 26, 2020 at 03:52:06PM +0800, Shung-Hsi Yu wrote:
quoted
On Sat, Aug 15, 2020 at 12:05:58AM +0800, Coiby Xu wrote:
quoted
This patch set aims to avoid dumping registers, data structures and
coredump to dmesg and also to reduce the code size of the qlge driver.

As pointed out by Benjamin [1],
quoted
At 2000 lines, qlge_dbg.c alone is larger than some entire ethernet
drivers. Most of what it does is dump kernel data structures or pci
memory mapped registers to dmesg. There are better facilities for that.
My thinking is not simply to delete qlge_dbg.c but to replace it, making
sure that most of the same information is still available. For data
structures, crash or drgn can be used; possibly with a script for the
latter which formats the data. For pci registers, they should be
included in the ethtool register dump and a patch added to ethtool to
pretty print them. That's what other drivers like e1000e do. For the
"coredump", devlink health can be used.
So the debugging features are re-written following Benjamin's advice,
   - use ethtool to dump registers
   - dump kernel data structures in drgn
   - use devlink health to do coredump

The get_regs ethtool_ops has already implemented. What lacks is a patch
for the userland ethtool to do the pretty-printing. I haven't yet provided
a patch to the userland ethtool because I'm aware ethtool is moving towards
the netlink interface [2]. I'm curious if a generalized mechanism of
pretty-printing will be implemented thus making pretty-printing for a
specific driver unnecessary. As of this writing, `-d|--register-dump`
hasn't been implemented for the netlink interface.


To dump kernel data structures, the following Python script can be used
in drgn,

    def align(x, a):
        """the alignment a should be a power of 2
        """
        mask = a - 1
        return (x+ mask) & ~mask

    def struct_size(struct_type):
        struct_str = "struct {}".format(struct_type)
        return sizeof(Object(prog, struct_str, address=0x0))

    def netdev_priv(netdevice):
        NETDEV_ALIGN = 32
        return netdevice.value_() + align(struct_size("net_device"), NETDEV_ALIGN)

    name = 'xxx'
    qlge_device = None
    netdevices = prog['init_net'].dev_base_head.address_of_()
    for netdevice in list_for_each_entry("struct net_device", netdevices, "dev_list"):
        if netdevice.name.string_().decode('ascii') == name:
            print(netdevice.name)

    ql_adapter = Object(prog, "struct ql_adapter", address=netdev_priv(qlge_device))
The struct ql_adapter will be printed in drgn as follows,
    >>> ql_adapter
    (struct ql_adapter){
            .ricb = (struct ricb){
                    .base_cq = (u8)0,
                    .flags = (u8)120,
                    .mask = (__le16)26637,
                    .hash_cq_id = (u8 [1024]){ 172, 142, 255, 255 },
                    .ipv6_hash_key = (__le32 [10]){},
                    .ipv4_hash_key = (__le32 [4]){},
            },
            .flags = (unsigned long)0,
            .wol = (u32)0,
            .nic_stats = (struct nic_stats){
                    .tx_pkts = (u64)0,
                    .tx_bytes = (u64)0,
                    .tx_mcast_pkts = (u64)0,
                    .tx_bcast_pkts = (u64)0,
                    .tx_ucast_pkts = (u64)0,
                    .tx_ctl_pkts = (u64)0,
                    .tx_pause_pkts = (u64)0,
                    ...
            },
            .active_vlans = (unsigned long [64]){
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 52780853100545, 18446744073709551615,
                    18446619461681283072, 0, 42949673024, 2147483647,
            },
            .rx_ring = (struct rx_ring [17]){
                    {
                            .cqicb = (struct cqicb){
                                    .msix_vect = (u8)0,
                                    .reserved1 = (u8)0,
                                    .reserved2 = (u8)0,
                                    .flags = (u8)0,
                                    .len = (__le16)0,
                                    .rid = (__le16)0,
                                    ...
                            },
                            .cq_base = (void *)0x0,
                            .cq_base_dma = (dma_addr_t)0,
                    }
                    ...
            }
    }


And the coredump obtained via devlink in json format looks like,

    $ devlink health dump show DEVICE reporter coredump -p -j
    {
        "Core Registers": {
            "segment": 1,
            "values": [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
        },
        "Test Logic Regs": {
            "segment": 2,
            "values": [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
        },
        "RMII Registers": {
            "segment": 3,
            "values": [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
        },
        ...
        "Sem Registers": {
            "segment": 50,
            "values": [ 0,0,0,0 ]
        }
    }

Since I don't have a QLGE device and neither could I find a software
simulator, I put some functions into e1000 to get the above result.
I'm not familiar with qlge, but I do happen to have one accessible. You can
add me to CC for future version of this patch.

Testing with the refreshed patch set I got from Benjamin (apply over commit
fc80c51fd4b2) with debugging options KASAN, UBSAN, DEBUG_ATOMIC_SLEEP,
PROVE_LOCKING and DEBUG_KMEMLEAK enabled, I can verify coredump through
devlink works.
Benjamin reminded me to check for errors in the remove callpaths as well,
and some error showed up in the remove callpath indeed.

 $ modprobe qlge
 [ 1314.256249] qlge: module is from the staging directory, the quality is unknown, you have been warned.
 [ 1320.184708] qlge 0000:00:04.0: QLogic 10 Gigabit PCI-E Ethernet Driver
 [ 1320.185619] qlge 0000:00:04.0: Driver name: qlge, Version: 1.00.00.35.
 [ 1320.196987] qlge 0000:00:04.0 eth0: Link is down.
 [ 1320.197830] qlge 0000:00:04.0 eth0: Clearing MAC address
 [ 1320.199184] qlge 0000:00:04.0 eth0: Function #0, Port 0, NIC Roll 0, NIC Rev = 1, XG Roll = 0, XG Rev = 1.
 [ 1320.200793] qlge 0000:00:04.0 eth0: MAC address 00:c0:dd:14:54:b8
 [ 1320.276835] qlge 0000:00:04.0 ens4: renamed from eth0
 [ 1326.920224] PCI Interrupt Link [LNKB] enabled at IRQ 10
 [ 1326.943443] qlge 0000:00:05.0 eth0: Link is down.
 [ 1326.944188] qlge 0000:00:05.0 eth0: Clearing MAC address
 [ 1326.944975] qlge 0000:00:05.0 eth0: Function #1, Port 1, NIC Roll 0, NIC Rev = 1, XG Roll = 0, XG Rev = 1.
 [ 1326.946396] qlge 0000:00:05.0 eth0: MAC address 00:c0:dd:14:54:ba
 [ 1326.997071] qlge 0000:00:05.0 ens5: renamed from eth0
 $ rmmod qlge

Simple load and unload works fine.

 $ modprobe qlge
 [ 1348.642124] qlge: module is from the staging directory, the quality is unknown, you have been warned.
 [ 1355.090448] qlge 0000:00:04.0: QLogic 10 Gigabit PCI-E Ethernet Driver
 [ 1355.091439] qlge 0000:00:04.0: Driver name: qlge, Version: 1.00.00.35.
 [ 1355.098890] qlge 0000:00:04.0 eth0: Link is down.
 [ 1355.099647] qlge 0000:00:04.0 eth0: Clearing MAC address
 [ 1355.100708] qlge 0000:00:04.0 eth0: Function #0, Port 0, NIC Roll 0, NIC Rev = 1, XG Roll = 0, XG Rev = 1.
 [ 1355.102301] qlge 0000:00:04.0 eth0: MAC address 00:c0:dd:14:54:b8
 [ 1355.170258] qlge 0000:00:04.0 ens4: renamed from eth0
 [ 1361.809918] qlge 0000:00:05.0 eth0: Link is down.
 [ 1361.810674] qlge 0000:00:05.0 eth0: Clearing MAC address
 [ 1361.811431] qlge 0000:00:05.0 eth0: Function #1, Port 1, NIC Roll 0, NIC Rev = 1, XG Roll = 0, XG Rev = 1.
 [ 1361.812594] qlge 0000:00:05.0 eth0: MAC address 00:c0:dd:14:54:ba
 [ 1361.874453] qlge 0000:00:05.0 ens5: renamed from eth0
 $ ip link set ens4 up
 [ 1370.994196] qlge 0000:00:04.0 ens4: MSI-X Enabled, got 2 vectors.
 [ 1371.199630] qlge 0000:00:04.0 ens4: Clearing MAC address
 [ 1371.299084] qlge 0000:00:04.0 ens4: Passed Get Port Configuration.
 $ devlink health dump show pci/0000:00:04.0 reporter coredump -p -j >log.jsonho
 $ rmmod qlge
 [ 1413.537710] qlge 0000:00:04.0 ens4: Link is down.
 [ 1413.538570] qlge 0000:00:04.0 ens4: Clearing MAC address
 [ 1413.539819] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1413.541068] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1413.649713] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1413.761705] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1413.873716] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1413.985746] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1414.097743] qlge 0000:00:04.0 ens4: Command not supported by firmware.

Unloading the module after coredump through devlink shows the above error.
At first glance this look quite similar to the errors I get when trying to
coredump when the interface is down, causing the device to go into an
invalid state as well.

 $ modprobe qlge
 [ 1606.229089] qlge: module is from the staging directory, the quality is unknown, you have been warned.
 [ 1612.423342] qlge 0000:00:04.0: QLogic 10 Gigabit PCI-E Ethernet Driver
 [ 1612.424551] qlge 0000:00:04.0: Driver name: qlge, Version: 1.00.00.35.
 [ 1612.432718] qlge 0000:00:04.0 eth0: Link is down.
 [ 1612.433627] qlge 0000:00:04.0 eth0: Clearing MAC address
 [ 1612.434621] qlge 0000:00:04.0 eth0: Function #0, Port 0, NIC Roll 0, NIC Rev = 1, XG Roll = 0, XG Rev = 1.
 [ 1612.436582] qlge 0000:00:04.0 eth0: MAC address 00:c0:dd:14:54:b8
 [ 1612.494647] qlge 0000:00:04.0 ens4: renamed from eth0
 [ 1619.078325] qlge 0000:00:05.0 eth0: Link is down.
 [ 1619.079119] qlge 0000:00:05.0 eth0: Clearing MAC address
 [ 1619.080165] qlge 0000:00:05.0 eth0: Function #1, Port 1, NIC Roll 0, NIC Rev = 1, XG Roll = 0, XG Rev = 1.
 [ 1619.081938] qlge 0000:00:05.0 eth0: MAC address 00:c0:dd:14:54:ba
 [ 1619.154477] qlge 0000:00:05.0 ens5: renamed from eth0
 $ ip link set ens4 up
 [ 1629.717532] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1629.720692] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1629.827878] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1629.939891] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1630.051855] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1630.163885] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1630.275872] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1630.387893] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1630.412685] qlge 0000:00:04.0 ens4: MSI-X Enabled, got 2 vectors.
 [ 1630.420868] qlge 0000:00:04.0 ens4: Failed about firmware command
 [ 1630.425110] qlge 0000:00:04.0 ens4: Failed to start port.
 [ 1630.427966] qlge 0000:00:04.0 ens4: Clearing MAC address
 [ 1630.487603] qlge 0000:00:04.0 ens4: Failed Get Port Configuration.

I'll report when I've more findings.
Thank you for the testing! I will assume these errors are caused by
coredump when the interface is down. Please let me know it still occurs
in v1.
<snip>
--
Best regards,
Coiby
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help