Re: system hang on start-up (mlx5?)
From: Chuck Lever III <chuck.lever@oracle.com>
Date: 2023-05-30 21:49:19
Also in:
linux-rdma
On May 30, 2023, at 3:46 PM, Thomas Gleixner [off-list ref] wrote: Chuck! On Tue, May 30 2023 at 13:09, Chuck Lever III wrote:quoted
quoted
On May 29, 2023, at 5:20 PM, Thomas Gleixner [off-list ref] wrote: But if you look at the address: 0xffffffffb9ef3f80 That one is bogus: managed_map=ffff9a36efcf0f80 managed_map=ffff9a36efd30f80 managed_map=ffff9a3aefc30f80 managed_map=ffff9a3aefc70f80 managed_map=ffff9a3aefd30f80 managed_map=ffff9a3aefd70f80 managed_map=ffffffffb9ef3f80 Can you spot the fail? The first six are in the direct map and the last one is in module map, which makes no sense at all.Indeed. The reason for that is that the affinity mask has bits set for CPU IDs that are not present on my system.Which I don't buy, but even if so then this should not make for_each_cpu() go south. See below.quoted
After bbac70c74183 ("net/mlx5: Use newer affinity descriptor") that mask is set up like this: struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev) { struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev); - cpumask_var_t req_mask; + struct irq_affinity_desc af_desc;That's daft. With NR_CPUS=8192 this is a whopping 1KB on stack...quoted
struct mlx5_irq *irq; - if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL)) - return ERR_PTR(-ENOMEM); - cpumask_copy(req_mask, cpu_online_mask); + cpumask_copy(&af_desc.mask, cpu_online_mask); + af_desc.is_managed = false; Which normally works as you would expect. But for some historical reason, I have CONFIG_NR_CPUS=32 on my system, and the cpumask_copy() misbehaves. If I correct mlx5_ctrl_irq_request() to clear @af_desc before the copy, this crash goes away. But mlx5_core crashes during a later part of its init, in cpu_rmap_update(). cpu_rmap_update() does exactly the same thing (for_each_cpu() on an affinity mask created by copying), and crashes in a very similar fashion. If I set CONFIG_NR_CPUS to a larger value, like 512, the problem vanishes entirely, and "modprobe mlx5_core" works as expected. Thus I think the problem is with cpumask_copy() or for_each_cpu() when NR_CPUS is a small value (the default is 8192).I don't buy any of this. Here is why: cpumask_copy(d, s) bitmap_copy(d, s, nbits = 32) len = BITS_TO_LONGS(nbits) * sizeof(unsigned long); So it copies as many longs as required to cover nbits, i.e. it copies any clobbered bits beyond nbits too. While that looks odd at the first glance, that's just an optimization which is harmless. for_each_cpu() finds the next set bit in a mask and breaks the loop once bitnr >= small_cpumask_bits, which is nr_cpu_ids and should be 32 too. I just booted a kernel with NR_CPUS=32:
My system has only 12 CPUs. So every bit in your mask represents a present CPU, but on my system, only 0x00000fff are ever present. Therefore, on my system, any bit higher than bit 11 in a CPU mask will reference a CPU that is not present.
[ 0.152988] smpboot: 56 Processors exceeds NR_CPUS limit of 32
[ 0.153606] smpboot: Allowing 32 CPUs, 0 hotplug CPUs
...
[ 0.173854] setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:1
and added a function which does:
struct cpumask ma, mb;
int cpu;
memset(&ma, 0xaa, sizeof(ma);
cpumask_copy(&mb, &ma);
pr_info("MASKBITS: %016lx\n", mb.bits[0]);
pr_info("CPUs:");
for_each_cpu(cpu, &mb)
pr_cont(" %d", cpu);
pr_cont("\n");
[ 2.165606] smp: MASKBITS: 0xaaaaaaaaaaaaaaaa
[ 2.166574] smp: CPUs: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
and the same exercise with a 0x55 filled source bitmap.
[ 2.167595] smp: MASKBITS: 0x5555555555555555
[ 2.168568] smp: CPUs: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
So while cpumask_copy copied beyond NR_CPUs bits, for_each_cpu() does
the right thing simple because of this:
nr_cpu_ids is 32, right?
for_each_cpu(cpu, mask)
for_each_set_bit(bit = cpu, addr = &mask, size = nr_cpu_ids)
for ((bit) = 0; (bit) = find_next_bit((addr), (size), (bit)), (bit) < (size); (bit)++)
So if find_next_bit() returns a bit after bit 31 the condition (bit) <
(size) will terminate the loop, right?Again, you are assuming more CPUs than there are bits in the mask.
Also in the case of that driver the copy is _NOT_ copying set bits beyond bit 31 simply because the source mask is cpu_online_mask which definitely does not have a bit set which is greater than 31. As the copy copies longs the resulting mask in af_desc.mask cannot have any bit set past bit 31 either. If the above is not correct, then there is a bigger problem than that MLX5 driver crashing. So this:quoted
If I correct mlx5_ctrl_irq_request() to clear @af_desc before the copy, this crash goes away.does not make any sense to me. Can you please add after the cpumask_copy() in that mlx5 code: pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]); pr_info("MASKBITS: %016lx\n", af_desc.mask.bits[0]);
Both are 0000 0000 0000 0fff, as expected on a system where 12 CPUs are present.
Please print also in irq_matrix_reserve_managed(): - @mask->bits[0] - nr_cpu_ids - the CPU numbers inside the for_each_cpu() loop
Here's where it gets interesting:
+ pr_info("%s: MASKBITS: %016lx\n", __func__, msk->bits[0]);
+ pr_info("%s: nr_cpu_ids=%u\n", __func__, nr_cpu_ids);
[ 70.957400][ T1185] mlx5_core 0000:81:00.0: firmware version: 16.35.2000
[ 70.964146][ T1185] mlx5_core 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[ 71.260555][ C9] port_module: 1 callbacks suppressed
[ 71.260561][ C9] mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged
[ 71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS: ffffb1a74686bcd8
[ 71.274096][ T10] mlx5_core 0000:81:00.0: mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
[ 71.280844][ T1185] irq_matrix_reserve_managed: nr_cpu_ids=12
[ 71.280846][ T1185] irq_matrix_reserve_managed: cm=ffff909aefcf0f48 cm->managed_map=ffff909aefcf0f80 cpu=3
[ 71.280849][ T1185] irq_matrix_reserve_managed: cm=ffff909aefd30f48 cm->managed_map=ffff909aefd30f80 cpu=4
[ 71.280851][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc30f48 cm->managed_map=ffff909eefc30f80 cpu=6
[ 71.280852][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc70f48 cm->managed_map=ffff909eefc70f80 cpu=7
[ 71.280854][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd30f48 cm->managed_map=ffff909eefd30f80 cpu=10
[ 71.280856][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd70f48 cm->managed_map=ffff909eefd70f80 cpu=11
[ 71.280858][ T1185] irq_matrix_reserve_managed: cm=ffffffff98ef3f48 cm->managed_map=ffffffff98ef3f80 cpu=12
Notice that there are in fact higher bits set than bit 11.
The lowest 16 bits of MASKBITS are bcd8, or in binary:
... 1011 1100 1101 1000
Starting from the low-order side: bits 3, 4, 6, 7, 10, 11, and
12, matching the CPU IDs from the loop. At bit 12, we fault,
since there is no CPU 12 on the system.
--
Chuck Lever