Re: BUG: 4.14.11 unable to handle kernel NULL pointer dereference in xfrm_lookup
From: Tobias Hommel <hidden>
Date: 2018-01-19 14:45:50
On Wed, Jan 10, 2018 at 10:03:05AM +0100, Tobias Hommel wrote:
On Wed, Jan 10, 2018 at 08:30:38AM +0100, Steffen Klassert wrote:quoted
On Tue, Jan 09, 2018 at 03:49:21PM +0100, Tobias Hommel wrote:quoted
I copied the config from my 4.14.12 sources to a fresh 4.13.16 source tree, ran `make olddefconfig` and built a new kernel. The kernel config is attached as kernel-4.13.16.config. The panic*.log files are kernel logs from different crashes of this 4.13.16 kernel, but all from the same scenario as before. I also enabled CONFIG_DEBUG_INFO, so if any disassemblies are required, I'd be happy to provide them. So, the system still crashes, but the traces are completely different from those with 4.14.12. This time there are also WARNINGs and "refcnt: -1" messages sometimes before the actual panic, so not sure if there is maybe some other problem. Still, the crashes all seem to be related to ip routing somehow.Strange, you must do something that other people don't do. Do you have some uncommon netfiler rules, namespaces, etc?No, no namespaces yet. However, the box uses marks and routing based on marks. Firewall marks are a bit strange sometimes, so I'll try to clean up everything and see if it is possible to reproduce the bug without marks.
I tried to strip down the system configuration and was able to reproduce the problem with a minimal configuration: * ipsets are not used anymore * no firewall markings are used any longer * iptables are "completely empty", i.e. all policies set to ACCEPT and there is no rule in any table * no additional routing policies (ip rule) except the default ones * only main routing table is used * using a "minimal" kernel config: * run `make defconfig` * add basic things (ESP, IGB driver, some crypto algorithms) * add options required to boot up the system (TPM crypt, some device mapper options, overlayfs) I attached the minimal config (minimal.config) and the defconfig for reference (minimal.defconfig). The setup is really simple now, the gateway is forwarding HTTP connections between eth1(IPSec tunnels) and eth0 without any firewall, NAT, whatsoever. The only thing I can think of are the rather aggressive roadwarrior clients. There are 750 roadwarriors that are controlled by a script which starts and stops the IPSec connection. Sometimes the clients are also instructed to start an HTTP download. Sometimes the clients are also stopped the hard way (kill -9) so SAs are not removed on the gateway. The clients reconnect after a random interval (sometimes immediately) and sometimes also immediately start a new HTTP download. Maybe something is wrong with strongswan removing old SAs and creating new ones for "the same client"? Maybe while the kernel is processing an HTTP packet from an old client connection, a new SA for the same client is set up and then the routing lookup fails (I only know that xfrm is involved in routing lookups, but I'm no expert here)?
quoted
Please try to build your kernels with CONFIG_ORC_UNWINDER (v4.14 and above) and CONFIG_KASAN This can give some better debug informations (depends on the compiler version).I'll also try that. I'm currently using GCC 5.4.0.quoted
There are some things we can do now: - Try v4.15-rc7, just to be sure that we don't search for something that is already fixed.And that one, too. All this will probably take some time though. ;-) I'll keep you informed.
I tried 4.15-rc8 and have the same problem here (see attached
kernel-4.15-rc8.log). SMP affinity for IRQs has changed in 4.15 and something's
broken there ("do_IRQ: 0.41 No irq handler for vector") and although I could
not spread IRQs over all cores I was able to "pin" different IRQs to different
cores and reproduce the problem.
Also kasan is reporting some "use-after-free" during startup in the page
poisoning code. So I disabled page poisoning "to get rid of this bug", but the
problem persists.
quoted
- Find a working kernel version and try to bisect. - Minimalize the configuration with that the bug happens, so that I can try to reproduce it here.
Attachments
- minimal.config [text/plain] 115780 bytes · preview
- minimal.defconfig [text/plain] 115182 bytes · preview
- kernel-4.15-rc8.log [text/plain] 52641 bytes · preview