Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm_lookup
From: Tobias Hommel <hidden>
Date: 2018-01-05 21:55:25
On Sat, Jan 06, 2018 at 12:27:11AM +0300, Ozgur wrote:
06.01.2018, 00:20, "Tobias Hommel" [off-list ref]:quoted
Hi,Hi Tobias,quoted
I'm running into a NULL pointer dereference after updating from Linux 4.1.6 to 4.14.11 (see kernel log below). I tried 4.14.3 initially which did not work either. Anyone has an idea what is happening here? The affected machine has 2 active ethernet interfaces (igb driver) and acts as a VPN gateway running strongswan. There are several hundreds of IPSec roadwarriors connecting to eth1. eth0 connects to an infrastructure running an HTTP server. During my tests these roadwarriors connect to the gateway, sometimes download a large file from the HTTP server, disconnect and after a random delay repeat these steps. Some observations I made: * SMP Affinity for IRQs of the NICs Rx/Tx queues (/proc/irq/$IRQ/smp_affinity) * all affinities set to default ff is broken * setting affinity for all queues of both interfaces to the same CPU seems to work fine (running stable for more than 1 day now) * setting affinity of eth0 queues to CPU 1 and affinity of eth1 queues to CPU 2 is broken and seems to always trigger the bug on CPU 1 * the top 6 entries of the call trace are the same every time the system crashes, the other entries differ sometimes The bug is 100% reproducible on the Intel Atom machine from the log below and also on a HP ProLiant Gen6 (also igb driver). I can, of course, provide further information (CPU, NIC, kernel config, more traces, etc.) if required. If helpful I could also run tests on HP ProLiant Gen9 which has different NICs (tg3). [ 7998.489094] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 [ 7998.496993] IP: xfrm_lookup+0x2a/0x7e0 [ 7998.500759] PGD 0 P4D 0 [ 7998.503316] Oops: 0000 [#1] SMP PTI [ 7998.506835] Modules linked in: [ 7998.509929] CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.14.11 #3 [ 7998.516244] Hardware name: To be filled by O.E.M. CAR-2051/CAR, BIOS 1.01 07/11/2016 [ 7998.524039] task: ffff8826bb118000 task.stack: ffff947ac00f0000 [ 7998.530004] RIP: 0010:xfrm_lookup+0x2a/0x7e0 [ 7998.534298] RSP: 0018:ffff947ac00f3b60 EFLAGS: 00010246 [ 7998.539550] RAX: 0000000000000000 RBX: ffffffff93074040 RCX: 0000000000000000 [ 7998.546709] RDX: ffff947ac00f3bd8 RSI: 0000000000000000 RDI: ffffffff93074040 [ 7998.553868] RBP: ffffffff93074040 R08: 0000000000000002 R09: 0000000000000001 [ 7998.561026] R10: 0000000000000032 R11: 0000000000000000 R12: ffff947ac00f3bd8 [ 7998.568212] R13: 0000000000000000 R14: 0000000000000002 R15: ffff8826b69a8078 [ 7998.575395] FS: 0000000000000000(0000) GS:ffff8826bfc80000(0000) knlGS:0000000000000000 [ 7998.583550] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7998.589324] CR2: 0000000000000020 CR3: 00000001781da000 CR4: 00000000001006e0 [ 7998.596482] Call Trace: [ 7998.598959] __xfrm_route_forward+0xa4/0x110 [ 7998.603263] ip_forward+0x3e0/0x450 [ 7998.606778] ? ip_rcv_finish+0x61/0x3a0 [ 7998.610645] ip_rcv+0x2c4/0x390 [ 7998.613818] ? inet_del_offload+0x30/0x30 [ 7998.617857] __netif_receive_skb_core+0x751/0xb00 [ 7998.622562] ? skb_send_sock+0x40/0x40 [ 7998.626356] ? netif_receive_skb_internal+0x47/0xf0 [ 7998.631252] netif_receive_skb_internal+0x47/0xf0 [ 7998.635987] napi_gro_receive+0x70/0x90 [ 7998.639835] gro_cell_poll+0x53/0x90 [ 7998.643439] net_rx_action+0x1fc/0x310 [ 7998.647210] ? rebalance_domains+0x101/0x2b0 [ 7998.651500] __do_softirq+0xd5/0x1cf [ 7998.655105] run_ksoftirqd+0x14/0x30 [ 7998.658712] smpboot_thread_fn+0xf9/0x150 [ 7998.662723] kthread+0xef/0x130 [ 7998.665893] ? sort_range+0x20/0x20 [ 7998.669404] ? kthread_park+0x60/0x60 [ 7998.673098] ret_from_fork+0x1f/0x30 [ 7998.676674] Code: 00 41 57 41 56 45 89 c6 41 55 41 54 49 89 f5 55 53 49 89 d4 48 89 fb 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 <48> 8b 46 20 48 85 c9 44 0f b7 38 c7 44 24 0c 00 00 00 00 0f 84 [ 7998.695681] RIP: xfrm_lookup+0x2a/0x7e0 RSP: ffff947ac00f3b60 [ 7998.701479] CR2: 0000000000000020 [ 7998.704799] ---[ end trace 0544b1946919baad ]--- [ 7998.709442] Kernel panic - not syncing: Fatal exception in interrupt [ 7998.715918] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)this error doesn't look like the last version kernel, I think this problem NIC driver. What is the use network ethernet card model?
This is what lspci shows for both NICs: # lspci -nns 00:14.0 00:14.0 Ethernet controller [0200]: Intel Corporation Ethernet Connection I354 [8086:1f41] (rev 03) I have currently no access to the other hardware where this is happening but I could get further information after the weekend.
And which driver version you use?
# ethtool -i eth0 # same for eth1 driver: igb version: 5.4.0-k firmware-version: 0.0.0 expansion-rom-version: bus-info: 0000:00:14.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes
quoted
Best regards, Tobias HommelOzgur