Re: BNX2: Kernel crashes with 2.6.31 and 2.6.31.9
From: Bruno Prémont <bonbons@linux-vserver.org>
Date: 2010-02-23 12:15:43
Also in:
lkml
Hi Benjamin, On Fri, 19 February 2010 "Benjamin Li" [off-list ref] wrote:
quoted
From your logs it looks like the device came up using MSI, but in theMSI-X poll routine was being called: [ 9.836673] bnx2: eth0: using MSI ... [ 134.643459] [<ffffffffa004019e>] bnx2_poll_msix+0x3e/0xd0 [bnx2] [ 134.643465] [<ffffffff8135bcd1>] netpoll_poll+0xe1/0x3c0 which is incorrect. If we are in MSI mode, the bnx2_poll() routine should be used. I think what is going on here is that during the bnx2x driver initialization the current bnx2 driver adds all possible NAPI structures that map to all the hardware vectors (BNX2_MAX_MSIX_VEC=9) to the NAPI list in the net_device structure regardless if they are used or not (Seen in drivers/net/bnx2.c:bnx2_init_napi()). This can cause uninitialized NAPI structures to be placed on the napi_list. Because this device is in MSI mode, only 1 vector is initialized. Now, the problem is triggered when net/core/netpoll.c:poll_napi() is called. This is because this routine will run through the entire napi_list calling all the poll routines. In your particular case, it is calling the poll routine on an uninitialized vector causing the kernel panic. Please try the patch below to see if it solves your problem. Note, this only have been compile tested and tested against basic traffic runs. Unfortunately, I could not reproduce the kernel panic with the instructions below to verify the patch. Thanks again for all your help in helping us track this down.
I applied the patch today and tried to reproduce with my showcases. Seems that it's harder to trigger now but I still end up being able to crash the box. Don't know if it's the same cause or not (could also be the tcp-retransmit ghost)... This time I had to run a few paralell scp's (8Mb/s each) to the box and 'echo t > /proc/sysrq-trigger' multiple times via ssh session for it to happen. It didn't trigger with by netbomb though I will try some more and see) I don't know if it's the same reason or not (hopefully something reached disk as serial console is dead and pings are not answered anymore. It's probably some printk/bug/warn that triggers in network stack and deadlocks with netconsole. Regards, Bruno