Thread (24 messages) 24 messages, 5 authors, 2010-03-11

Re: BNX2: Kernel crashes with 2.6.31 and 2.6.31.9

From: Bruno Prémont <bonbons@linux-vserver.org>
Date: 2010-02-23 12:15:43
Also in: lkml

Hi Benjamin,

On Fri, 19 February 2010 "Benjamin Li" [off-list ref] wrote:
quoted
From your logs it looks like the device came up using MSI, but in the
MSI-X poll routine was being called:

[    9.836673] bnx2: eth0: using MSI
...

[  134.643459]  [<ffffffffa004019e>] bnx2_poll_msix+0x3e/0xd0 [bnx2]
[  134.643465]  [<ffffffff8135bcd1>] netpoll_poll+0xe1/0x3c0

which is incorrect.  If we are in MSI mode, the bnx2_poll() routine
should be used.

I think what is going on here is that during the bnx2x driver
initialization the current bnx2 driver adds all possible NAPI
structures that map to all the hardware vectors (BNX2_MAX_MSIX_VEC=9)
to the NAPI list in the net_device structure regardless if they are
used or not (Seen in drivers/net/bnx2.c:bnx2_init_napi()).  This can
cause uninitialized NAPI structures to be placed on the napi_list.
Because this device is in MSI mode, only 1 vector is initialized.
Now, the problem is triggered when net/core/netpoll.c:poll_napi() is
called. This is because this routine will run through the entire
napi_list calling all the poll routines.  In your particular case, it
is calling the poll routine on an uninitialized vector causing the
kernel panic.

Please try the patch below to see if it solves your problem.  Note,
this only have been compile tested and tested against basic traffic
runs. Unfortunately, I could not reproduce the kernel panic with the
instructions below to verify the patch.

Thanks again for all your help in helping us track this down.
I applied the patch today and tried to reproduce with my showcases.

Seems that it's harder to trigger now but I still end up being able to
crash the box. Don't know if it's the same cause or not (could also
be the tcp-retransmit ghost)...

This time I had to run a few paralell scp's (8Mb/s each) to the box and 
'echo t > /proc/sysrq-trigger' multiple times via ssh session for it to
happen. It didn't trigger with by netbomb though I will try some more
and see)

I don't know if it's the same reason or not (hopefully something
reached disk as serial console is dead and pings are not
answered anymore.
It's probably some printk/bug/warn that triggers in network stack and
deadlocks with netconsole.

Regards,
Bruno
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help