Re: BNX2: Kernel crashes with 2.6.31 and 2.6.31.9

From: Bruno Prémont <bonbons@linux-vserver.org>
Date: 2009-12-29 09:33:24
Also in: lkml

Hi Benjamin,

On Tue, 29 Dec 2009 01:05:40 "Benjamin Li" [off-list ref] wrote:

Hi Bruno,

It looks like the the NULL dereference is happening at a0fc.

a0f8:       48 8b 42 70             mov 0x70(%rdx),%rax 
a0fc:       0f b7 10                movzwl (%rax),%edx
a0ff:       31 c0                   xor    %eax,%eax

Thanks for confirming my guess

The offset of 0x70 is the bp field in the bnx2_napi structure.  (Seen
in the bnx2_napi structure dump below)  These lines are found in the
routine, bnx2_get_hw_tx_cons() which look like they were inlined by
the compiler.  More specifically it looks like the dereference of the
hw_tx_cons_ptr failed.

cons = *bnapi->hw_tx_cons_ptr;

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/net/bnx2.c;h=06b901152d4487fa04164437cc179661b44657fe;hb=74fca6a42863ffacaf7ba6f1936a9f228950f657#l2761

To be sure this is the case, could you send the .config file you are
using or if you could send me the bnx2 kernel module built with the
CFLAG '-g', then we can definitely verify where in the code it is
crashing.

See attached .config, if needed I can recompile with the module with
'-g', but the original instance does not contain debugging info.

Did you see anything suspicious in the system kernel logs?  If you
could isolate the logs from when the machine booted to when it crash
and send it to us it would be very helpful.

Unfortunately there is nothing suspicious in there, all I have is
attached dmesg (with IP addresses, MAC addresses replaced by '*'s)

I've not appended the crash dump gathered via netconsole which didn't
make it to the affected system's disk (see previous mail for it).


Regards,
Bruno

Thanks again for your time.

-Ben


<--snip snip structure dump from pahole-->
struct bnx2_napi {
        struct napi_struct         napi;                 /*     0
96 */
        /* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
        struct bnx2 *              bp;                   /*    96
8 */
        union {
                struct status_block * msi;               /*
8 */
                struct status_block_msix * msix;         /*
8 */
        } status_blk;                                    /*   104
8 */
        u16 *                      hw_tx_cons_ptr;       /*   112
8 */
        u16 *                      hw_rx_cons_ptr;       /*   120
8 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        u32                        last_status_idx;      /*   128
4 */
        u32                        int_num;              /*   132
4 */
        struct bnx2_rx_ring_info   rx_ring;              /*   136
360 */
        /* --- cacheline 7 boundary (448 bytes) was 48 bytes ago ---
*/ struct bnx2_tx_ring_info   tx_ring;              /*   496    48
*/
        /* --- cacheline 8 boundary (512 bytes) was 32 bytes ago ---
*/

        /* size: 576, cachelines: 9 */
        /* padding: 32 */
};
<--snip snip-->

On Mon, 2009-12-28 at 23:49 -0800, Bruno Prémont wrote:

quoted

On a system that was running 2.6.31 since last September I got two
crashes this December at night (cause unknown), yesterday after
second crash I updated kernel to 2.6.31.9 and enabled netconsole in
the hope to get some information about the cause of the crash.

Today system crashed once again and all I got is the following
incomplete trace on the receiving side of netconsole:

[24701.841185] BUG: unable to handle kernel NULL pointer
dereference at (null) [24701.841188] IP: [<ffffffffa00610fc>]
bnx2_poll_work+0x2c/0x12d0 [bnx2] [24701.841197] PGD 16509067 PUD
4e776067 PMD 0 [24701.841199] Oops: 0000 [#1] SMP
[24701.841202] last sysfs file: /sys/kernel/uevent_seqnum
[24701.841204] CPU 0
[24701.841205] Modules linked in: ipmi_devintf squashfs ext2
zlib_inflate netconsole configfs loop dm_round_robin scsi_dh_rdac
dm_multipath scsi_dh dm_mod sg sr_mod cdrom ata_piix i pmi_si
ipmi_msghandler qla2xxx ahci bnx2 hpwdt uhci_hcd ehci_hcd libata
[24701.841218] Pid: 11273, comm: php-cgi Not tainted
2.6.31.9-x86_64 #1 ProLiant DL360 G5 [24701.841220] RIP:
0010:[<ffffffffa00610fc>]  [<ffffffffa00610fc>]
bnx2_poll_work+0x2c/0x12d0 [bnx2]


Running objdump on the bnx2.ko module I get the following:
000000000000a0d0 <bnx2_poll_work>:
    a0d0:       41 57                   push   %r15
    a0d2:       41 56                   push   %r14
    a0d4:       41 55                   push   %r13
    a0d6:       41 54                   push   %r12
    a0d8:       55                      push   %rbp
    a0d9:       53                      push   %rbx
    a0da:       48 81 ec 28 01 00 00    sub    $0x128,%rsp
    a0e1:       48 89 7c 24 18          mov    %rdi,0x18(%rsp)
    a0e6:       48 89 74 24 10          mov    %rsi,0x10(%rsp)
    a0eb:       89 54 24 0c             mov    %edx,0xc(%rsp)
    a0ef:       89 4c 24 08             mov    %ecx,0x8(%rsp)
    a0f3:       48 8b 54 24 10          mov    0x10(%rsp),%rdx
    a0f8:       48 8b 42 70             mov    0x70(%rdx),%rax
    a0fc:       0f b7 10                movzwl (%rax),%edx
    a0ff:       31 c0                   xor    %eax,%eax
    a101:       48 8b 4c 24 10          mov    0x10(%rsp),%rcx
    a106:       80 fa ff                cmp    $0xff,%dl
    a109:       0f 94 c0                sete   %al
    a10c:       01 c2                   add    %eax,%edx
    a10e:       66 39 91 1a 02 00 00    cmp    %dx,0x21a(%rcx)
    a115:       0f 84 78 01 00 00       je     a293
<bnx2_poll_work+0x1c3> a11b:       48 8b 57 08             mov
0x8(%rdi),%rdx a11f:       48 89 f8                mov    %rdi,%rax
    a122:       48 8b 9a 00 03 00 00    mov    0x300(%rdx),%rbx
    a129:       48 83 c0 40             add    $0x40,%rax
    a12d:       48 29 c1                sub    %rax,%rcx
    a130:       48 89 c8                mov    %rcx,%rax
    a133:       48 c1 f8 06             sar    $0x6,%rax
    a137:       69 c0 39 8e e3 38       imul   $0x38e38e39,%eax,%eax
    a13d:       48 c1 e0 07             shl    $0x7,%rax
    a141:       48 01 d8                add    %rbx,%rax
    a144:       48 89 44 24 20          mov    %rax,0x20(%rsp)
    a149:       48 8b 7c 24 10          mov    0x10(%rsp),%rdi
    a14e:       48 8b 47 70             mov    0x70(%rdi),%rax
    a152:       44 0f b7 30             movzwl (%rax),%r14d
    a156:       31 c0                   xor    %eax,%eax
    a158:       0f b7 9f 18 02 00 00    movzwl 0x218(%rdi),%ebx
    a15f:       41 80 fe ff             cmp    $0xff,%r14b
    a163:       0f 94 c0                sete   %al
    a166:       45 31 ff                xor    %r15d,%r15d
    a169:       41 01 c6                add    %eax,%r14d
    a16c:       66 44 39 f3             cmp    %r14w,%bx
    a170:       0f 84 ee 00 00 00       je     a264
<bnx2_poll_work+0x194> a176:       66 2e 0f 1f 84 00 00    nopw
%cs:0x0(%rax,%rax,1) a17d:       00 00 00 
    a180:       0f b6 cb                movzbl %bl,%ecx
    a183:       48 8b 44 24 10          mov    0x10(%rsp),%rax
    a188:       44 0f b7 e1             movzwl %cx,%r12d
    a18c:       49 c1 e4 04             shl    $0x4,%r12
    a190:       4c 03 a0 10 02 00 00    add    0x210(%rax),%r12
    a197:       4d 8b 2c 24             mov    (%r12),%r13
    a19b:       66 41 83 7c 24 08 00    cmpw   $0x0,0x8(%r12)
    a1a2:       41 0f 18 8d bc 00 00    prefetcht0 0xbc(%r13)
    a1a9:       00 
                ...


Kernel is compiled on Gentoo (64bit):
  Linux version 2.6.31.9-x86_64 () (gcc version 4.3.4 (Gentoo 4.3.4
p1.0, pie-10.1.5) ) #1 SMP Mon Dec 28 15:49:16 CET 2009 The
affected server (HP DL360 G5) is running OpenSuSE-11.1, 32bit
userspace

Any idea if there is a recent patch that could fix this issue? At
the crashing time the server was not specifically loaded and had
around 200 packets/s network traffic.

Regards,
Bruno

Attachments

dmesg [text/plain] 50108 bytes · preview
.config [text/plain] 51367 bytes · preview

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help