Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1
From: Coiby Xu <hidden>
Date: 2025-08-27 08:36:10
Also in:
kexec, linux-pci
On Sat, Aug 23, 2025 at 11:00:11AM +0800, Coiby Xu wrote:
Hi Marc, If I understand correctly, you want to reproduce the issue by yourself. Then finally I manage to reproduce this issue by playing with the setup shared by my collogue. Here are the five prerequisites to reproduce the bug,
Hi Marc, It turns out host kernel and host machine are not absolute prerequisites to reproduce the problem. But they matter because they can make it much more difficult to reproduce this problem. I also did a bisection against QEMU to find out which commit make the issue gone. For details, please check following inline comments.
1. Guest kernel Newer than commit b5712bf89b4b
("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
2. Host kernel
Relatively older ones like v6.10.0. Newer ones like v6.12.0 and
v6.17.0 don't have this issue.It turns out with other conditions met, the latest host kernel (6.17.0-0.rc3) can still reproduce the issue but it's much more difficult to reproduce it. For example, with RHEL8 kernel 4.18.0-372.9.1.el8.aarch64, I need to trigger kernel crash for 3 times at maximum to reproduce it. But for Fedora rawhide kernel 6.17.0-0.rc3.31.fc43.aarch64, 3/10 times I can't reproduce this issue after triggering kernel crash for 60 consecutive times. For a comparison, I've listed the times of triggering kernel crash to reproduce the issue in 10 trials, RHEL8: 2 1 1 1 1 1 2 1 3 2 Fedora rawhide: 43 60 47 60 12 56 60 45 49 18
3. QEMU <= v6.2
I did a bisection and it shows the issue is gone with QEMU commit
f39b7d2b96e3e73c01bb678cd096f7baf0b9ab39 ("kvm: Atomic memslot updates")
which is last/3rd patch of patch set "KVM: allow listener to stop all
vcpus before"
https://lists.nongnu.org/archive/html/qemu-devel/2022-11/msg02172.html
Note this commit shows in QEMU > 7.2 so QEMU <= v7.2.0 can also
reproduce this issue.
4. Specific host machines I'm not familiar with the hardware so currently I haven't figured out what hardware factor makes the issue reproducible. I've attached dmidecode outputs of four machines (files inside indmidecode_host folder). Two systems (dmidecode_not_work*) can reproduce this issue and the other two systems (dmidecode_work*) can't despite all have the same product name R152-P31-00, CPU model ARMv8 (M128-30) and SKU 01234567890123456789AB. One difference that doesn't seem to found in the dmidecode output is the two machines that can't reproduce the issue have the model name "PnP device PNP0c02" where the problematic machines have "R152-P31-00 (01234567890123456789AB)" according to our internal web pages that show the hardware info.
It turns out all four machines can reproduce the issue. I tried to reproduce this issue for 10 times and counted the times to trigger kernel crash and here's a comparison R152-P31-00: 2 1 1 1 1 1 2 1 3 2 PnP device PNP0c02: 8 3 5 15 11 18 2 5 12 4
5. The Guest needs to be bridged to a physical host interface. Bridging the guest to tun interface can't reproduce the issue (for example, the default bridge (virbr0) created by libvirtd uses tun interface)
I tried triggering kernel crash for 100 consecutive times for virbr0 in one trial but can't reproduce it. So I think bridging the guest to a physical network interface is still a must. [...] -- Best regards, Coiby