[Bug 206669] Little-endian kernel crashing on POWER8 on heavy big-endian PowerKVM load
From: <hidden>
Date: 2020-02-26 09:31:21
https://bugzilla.kernel.org/show_bug.cgi?id=206669
--- Comment #3 from npiggin@gmail.com ---bugzilla-daemon@bugzilla.kernel.org's on February 26, 2020 5:26 pm:
quoted hunk ↗ jump to hunk
https://bugzilla.kernel.org/show_bug.cgi?id=206669--- Comment #2 from John Paul Adrian Glaubitz (glaubitz@physik.fu-berlin.de)--- (In reply to npiggin from comment #1)quoted
Thanks for the report, we need to get more data about the first BUG if we can. What function in your vmlinux contains address 0xc00000000017a778? (use nm or objdump etc)Seems to be t select_task_rq_fair: root@watson:/boot# nm vmlinux-5.4.0-0.bpo.3-powerpc64le |grep -C5 c00000000017a c000000000448550 T select_estimate_accuracy c000000000170d20 t select_fallback_rq c000000000e4c940 D select_idle_mask c000000000179f10 t select_idle_sibling c00000000018fd80 t select_task_rq_dl c00000000017a640 t select_task_rq_fair c000000000177f50 t select_task_rq_idle c00000000018c9e0 t select_task_rq_rt c00000000019c800 t select_task_rq_stop c000000000927710 t selem_alloc.isra.6 c000000000926e50 t selem_link_map root@watson:/boot#quoted
Is that the first message you get, No warnings or anything else earlier in the dmesg?Correct. You can see the login prompt of the host VM watson directly after booting up.quoted
Also 0xc0000000002659a0 would be interesting.Looks like that's ring_buffer_record_off: root@watson:/boot# nm vmlinux-5.4.0-0.bpo.3-powerpc64le |grep -C5 c0000000002659 c0000000002667e0 T ring_buffer_read_finish c00000000026b4b0 T ring_buffer_read_page c000000000265e10 T ring_buffer_read_prepare c000000000265ef0 T ring_buffer_read_prepare_sync c000000000269ae0 T ring_buffer_read_start c000000000265950 T ring_buffer_record_disable c000000000266070 T ring_buffer_record_disable_cpu c000000000265970 T ring_buffer_record_enable c0000000002660c0 T ring_buffer_record_enable_cpu c00000000026d470 T ring_buffer_record_is_on c00000000026d480 T ring_buffer_record_is_set_on c000000000265990 T ring_buffer_record_off c000000000265a10 T ring_buffer_record_on c000000000266da0 T ring_buffer_reset c000000000266a90 T ring_buffer_reset_cpu c000000000267cd0 T ring_buffer_resize c00000000026d400 T ring_buffer_set_clock root@watson:/boot#
Thanks. Okay it looks like what's happening here is something crashes in select_task_rq_fair (kernel data access fault). It's then able to print out those first two lines but then it calls die(), which ends up calling oops_enter() which calls tracing_off(), which calls tracer_tracing_off and crashes there, which goes around the same cycle only printing out the first two lines. Nothing obvious as to why those accesses in particular are crashing. The first data address is 0xc000000002bfd038, the second is 0xc0000007f9070c08. Not vmalloc space, not above the 1TB segment. Do you have tracing / ftrace enabled in the host kernel for any reason? Turning that off might let the oops message get printed.
FWIW, the kernel image comes from this Debian package:
Okay. Any chance you could test an upstream kernel?
quoted
When reproducing, do you ever get a clean trace of the first bug?I have logged everything that showed in the console during and after the crash. After that, the machine no longer responds and has to be hard-resetted.quoted
Could you try setting /proc/sys/kernel/panic_on_oops and reproducing?I will try that.
Don't bother testing that after the above -- panic_on_oops happens after oops_begin(), so it won't help unfortunately. Attmepting to get into xmon might though, if you boot with xmon=on. Try that if tracing wasn't enabled, or disabling it doesn't help.
Anything to be considered for the kernel running inside the big-endian VM?
Not that I'm aware of really. Certainly it shouldn't be able to crash the host even if the guest was doing something stupid. Thanks, Nick -- You are receiving this mail because: You are watching the assignee of the bug.