Re: KASAN debug kernel fails to boot at early stage when CONFIG_SMP=y is set (kernel 6.5-rc5, PowerMac G4 3,6)
From: Erhard Furtner <hidden>
Date: 2023-08-31 22:45:39
On Thu, 31 Aug 2023 05:32:46 +0000 Christophe Leroy [off-list ref] wrote:
Ok so there is some corrupted memory somewhere. Can you try what happens when you remove the call to kasan_init() at the start of setup_arch() in arch/powerpc/kernel/setup-common.c
Ok, so I left the other patches in place + btext_map() instead of btext_unmap() at the end of MMU_init() + Michaels patch and additionally commented-out kasan_init() as stated above. The outcome is rather interesting! Now I deterministically get this output at boot OF console, regardless wheter it's a cold boot or warm boot: via-pmu: Server Mode is disabled PMU driver v2 initialized for Core99, firmware: 0c ioremap() called early from pmac_nvram_init+0x208/0x7c0. Use early_ioremap() instead nvram: Checking bank 0... nvram: gen0=3234, gen1=3235 nvram: Active bank is: 1 nvram: OF partition at 0x410 nvram: XP partition at 0x1020 nvram: NR partition at 0x1120 Top of RAM: 0x80000000, Total RAM: 0x80000000 Memory hole size: 0MB Zone ranges: DMA [mem 0x0000000000000000-0x000000002fffffff] Normal empty HighMem [mem 0x0000000030000000-0x000000007fffffff] Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000000000000-0x000000007fffffff] Initmem setup node 0 [mem 0x0000000000000000-0x000000007fffffff] percpu: Embedded 14 pages/cpu s24608 r8192 d24544 u57344 pcpu-alloc: s24608 r8192 d24544 u57344 alloc=14*4096 pcpu-alloc: [0] 0 Kernel command line: ro root=/dev/sda5 nr_cpus=1 zswap.max_pool_percent=16 slub_debug=FZP page_poison=1 netconsole=6666@192.168.178.8/eth0,6666@192.168.178.3/70:85:C2:30:EC:01 init=/usr/lib/systemd/systemd Dentry cache hash table entries: 131072 (order: 7, 524288 bytes, linear) Inode-cache hash table entries: 65536 (order: 6, 262144 bytes, linear) Built 1 zonelists, mobility grouping on. Total pages: 522560 mem auto-init: stack:all(pattern), heap alloc:off, heap free:off stackdepot: allocating hash table via alloc_large_system_hash stackdepot hash table entries: 1048576 (order: 10, 4194304 bytes, linear) ================================================================== BUG: KASAN: stack-out-of-bounds in __kernel_poison_pages+0x6c/0xd0 Write of size 4896 at addr c17a7000 by task swapper/0 CPU: 0 PID: 0 Comm: swapper Tainted: G T 6.5.0-rc7-PMacG4-dirty #7 Hardware name: PowerMac3,6 7455 0x80010303 PowerMac Call Trace: [c1717ce0] [c0f4ec40] dump_stack_lvl+0x60/0xa4 (unreliable) [c1717d00] [c0368380] print_report+0x154/0x548 [c1717d50] [c036813c] kasan_report+0xd0/0x160 [c1717db0] [c0369bb4] kasan_check_range+0x1c8/0x308 [c1717dc0] [c036ae88] memset+0x34/0x90 [c1717de0] [c035b6e0] __kernel_poison_pages+0x6c/0xd0 [c1717e00] [c03355e4] __free_pages_ok+0x418/0x500 [c1717e60] [c14372c8] memblock_free_all+0x268/0x400 [c1717f20] [c14103fc] mem_init+0x8c/0x274 [c1717f60] [c1431cd0] mm_core_init+0x240/0x4e0 [c1717fc0] [c1404694] start_kernel+0x150/0x2d8 [c1717f00] [000035d0] 0x35d0 The buggy address belongs to the physical page: page:(ptrval) refcount:0 mapcount:0 mapping:00000000 index:0x0 pfn:0x17a7 flags: 0x0(zone=0) page_type: 0xffffffff() raw: 00000000 eee15380 eee15380 00000000 00000000 00000000 ffffffff 00000000 raw: 00000000 page dumped because: kasan: bad access detected Memory state around the buggy address: c17a7d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c17a7d80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c17a7e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
^ c17a7e80: f1 f1 04 f2 04 f2 00 f3 f3 f3 00 00 00 00 00 00 c17a7f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Disabling lock debugging due to kernel taint
I'd also be curious to know what happens when CONFIG_DEBUG_SPINLOCK is disabled.
Disabling CONFIG_DEBUG_SPINLOCK does not change the output above. ^^
Another question which I'm no sure I asked already: Is it a new problem you have got with recent kernels or is it just that you never tried such a config with older kernels ?
I wanted to revisit https://bugzilla.kernel.org/show_bug.cgi?id=216041 and verify whether it was resolved. KASAN worked around 2019-2021 on my G4 as I reported some bugs with it around that time and you fixed some of the bugs. ;) Like kernel bugzilla #205099, #216190, #205885. But it always seemed flaky on the G4 and had it's problems. So I can't tell whether this specific issue was there back then or if it's new. At least bug #216190 was also about KASAN and SMP issues.
Also, when you say you need to start with another SMP kernel first and then you don't have the problem anymore until the next cold reboot, do you mean you have some old kernel with KASAN that works, or is it a kernel without KASAN that you have to start first ?
First. I start with a non-KASAN SMP kernel and afterwards reboot into a KASAN kernel. But now with kasan_init() commented-out in start of setup_arch() in arch/powerpc/kernel/setup-common.c this does not work anymore. I get the dmesg above all the time, at cold and warm boots. Regards, Erhard