[Bug 206669] Little-endian kernel crashing on POWER8 on heavy big-endian... | linuxppc-dev

[Bug 206669] Little-endian kernel crashing on POWER8 on heavy big-endian PowerKVM load

From: <hidden>
Date: 2020-02-26 09:31:21

https://bugzilla.kernel.org/show_bug.cgi?id=206669

--- Comment #3 from npiggin@gmail.com ---

bugzilla-daemon@bugzilla.kernel.org's on February 26, 2020 5:26 pm:

quoted hunk ↗ jump to hunk

https://bugzilla.kernel.org/show_bug.cgi?id=206669

--- Comment #2 from John Paul Adrian Glaubitz (glaubitz@physik.fu-berlin.de)

---
(In reply to npiggin from comment #1)

quoted

Thanks for the report, we need to get more data about the first BUG if 
we can. What function in your vmlinux contains address 
0xc00000000017a778? (use nm or objdump etc)

Seems to be t select_task_rq_fair:

root@watson:/boot# nm vmlinux-5.4.0-0.bpo.3-powerpc64le |grep -C5
c00000000017a
c000000000448550 T select_estimate_accuracy
c000000000170d20 t select_fallback_rq
c000000000e4c940 D select_idle_mask
c000000000179f10 t select_idle_sibling
c00000000018fd80 t select_task_rq_dl
c00000000017a640 t select_task_rq_fair
c000000000177f50 t select_task_rq_idle
c00000000018c9e0 t select_task_rq_rt
c00000000019c800 t select_task_rq_stop
c000000000927710 t selem_alloc.isra.6
c000000000926e50 t selem_link_map
root@watson:/boot#

quoted

Is that the first message you
get,
No warnings or anything else earlier in the dmesg?

Correct. You can see the login prompt of the host VM watson directly after
booting up.

quoted

Also 0xc0000000002659a0 would be interesting.

Looks like that's ring_buffer_record_off:

root@watson:/boot# nm vmlinux-5.4.0-0.bpo.3-powerpc64le |grep -C5
c0000000002659
c0000000002667e0 T ring_buffer_read_finish
c00000000026b4b0 T ring_buffer_read_page
c000000000265e10 T ring_buffer_read_prepare
c000000000265ef0 T ring_buffer_read_prepare_sync
c000000000269ae0 T ring_buffer_read_start
c000000000265950 T ring_buffer_record_disable
c000000000266070 T ring_buffer_record_disable_cpu
c000000000265970 T ring_buffer_record_enable
c0000000002660c0 T ring_buffer_record_enable_cpu
c00000000026d470 T ring_buffer_record_is_on
c00000000026d480 T ring_buffer_record_is_set_on
c000000000265990 T ring_buffer_record_off
c000000000265a10 T ring_buffer_record_on
c000000000266da0 T ring_buffer_reset
c000000000266a90 T ring_buffer_reset_cpu
c000000000267cd0 T ring_buffer_resize
c00000000026d400 T ring_buffer_set_clock
root@watson:/boot#

Thanks.

Okay it looks like what's happening here is something crashes in
select_task_rq_fair (kernel data access fault). It's then able to
print out those first two lines but then it calls die(), which
ends up calling oops_enter() which calls tracing_off(), which calls
tracer_tracing_off and crashes there, which goes around the same
cycle only printing out the first two lines.

Nothing obvious as to why those accesses in particular are crashing.
The first data address is 0xc000000002bfd038, the second is
0xc0000007f9070c08. Not vmalloc space, not above the 1TB segment.

Do you have tracing / ftrace enabled in the host kernel for any
reason? Turning that off might let the oops message get printed.

FWIW, the kernel image comes from this Debian package:

quoted


http://snapshot.debian.org/archive/debian/20200211T210433Z/pool/main/l/linux/linux-image-5.4.0-0.bpo.3-powerpc64le_5.4.13-1%7Ebpo10%2B1_ppc64el.deb

Okay. Any chance you could test an upstream kernel?

quoted

When reproducing, do you ever get a clean trace of the first bug?

I have logged everything that showed in the console during and after the
crash.
After that, the machine no longer responds and has to be hard-resetted.

quoted

Could you try setting /proc/sys/kernel/panic_on_oops and reproducing?

I will try that.

Don't bother testing that after the above -- panic_on_oops happens
after oops_begin(), so it won't help unfortunately.

Attmepting to get into xmon might though, if you boot with xmon=on.
Try that if tracing wasn't enabled, or disabling it doesn't help.

Anything to be considered for the kernel running inside the big-endian VM?

Not that I'm aware of really. Certainly it shouldn't be able to crash
the host even if the guest was doing something stupid.

Thanks,
Nick

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help