Re: system hang on start-up (mlx5?)

From: Chuck Lever III <chuck.lever@oracle.com>
Date: 2023-05-30 21:49:19
Also in: linux-rdma

On May 30, 2023, at 3:46 PM, Thomas Gleixner [off-list ref] wrote:

Chuck!

On Tue, May 30 2023 at 13:09, Chuck Lever III wrote:

quoted

On May 29, 2023, at 5:20 PM, Thomas Gleixner [off-list ref] wrote:
But if you look at the address: 0xffffffffb9ef3f80

That one is bogus:

   managed_map=ffff9a36efcf0f80
   managed_map=ffff9a36efd30f80
   managed_map=ffff9a3aefc30f80
   managed_map=ffff9a3aefc70f80
   managed_map=ffff9a3aefd30f80
   managed_map=ffff9a3aefd70f80
   managed_map=ffffffffb9ef3f80

Can you spot the fail?

The first six are in the direct map and the last one is in module map,
which makes no sense at all.

Indeed. The reason for that is that the affinity mask has bits
set for CPU IDs that are not present on my system.

Which I don't buy, but even if so then this should not make
for_each_cpu() go south. See below.

quoted

After bbac70c74183 ("net/mlx5: Use newer affinity descriptor")
that mask is set up like this:

struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev)
{
       struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev);
-       cpumask_var_t req_mask;
+       struct irq_affinity_desc af_desc;

That's daft. With NR_CPUS=8192 this is a whopping 1KB on stack...

quoted

       struct mlx5_irq *irq;
-       if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL))
-               return ERR_PTR(-ENOMEM);
-       cpumask_copy(req_mask, cpu_online_mask);
+       cpumask_copy(&af_desc.mask, cpu_online_mask);
+       af_desc.is_managed = false;

Which normally works as you would expect. But for some historical
reason, I have CONFIG_NR_CPUS=32 on my system, and the
cpumask_copy() misbehaves.

If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
copy, this crash goes away. But mlx5_core crashes during a later
part of its init, in cpu_rmap_update(). cpu_rmap_update() does
exactly the same thing (for_each_cpu() on an affinity mask created
by copying), and crashes in a very similar fashion.

If I set CONFIG_NR_CPUS to a larger value, like 512, the problem
vanishes entirely, and "modprobe mlx5_core" works as expected.

Thus I think the problem is with cpumask_copy() or for_each_cpu()
when NR_CPUS is a small value (the default is 8192).

I don't buy any of this. Here is why:

cpumask_copy(d, s)
  bitmap_copy(d, s, nbits = 32)
    len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);

So it copies as many longs as required to cover nbits, i.e. it copies
any clobbered bits beyond nbits too. While that looks odd at the first
glance, that's just an optimization which is harmless.

for_each_cpu() finds the next set bit in a mask and breaks the loop once
bitnr >= small_cpumask_bits, which is nr_cpu_ids and should be 32 too.

I just booted a kernel with NR_CPUS=32:

My system has only 12 CPUs. So every bit in your mask represents
a present CPU, but on my system, only 0x00000fff are ever present.

Therefore, on my system, any bit higher than bit 11 in a CPU mask
will reference a CPU that is not present.

[    0.152988] smpboot: 56 Processors exceeds NR_CPUS limit of 32
[    0.153606] smpboot: Allowing 32 CPUs, 0 hotplug CPUs
...
[    0.173854] setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:1

and added a function which does:

   struct cpumask ma, mb;
   int cpu;

   memset(&ma, 0xaa, sizeof(ma);
   cpumask_copy(&mb, &ma);
   pr_info("MASKBITS: %016lx\n", mb.bits[0]);
   pr_info("CPUs:");
   for_each_cpu(cpu, &mb)
        pr_cont(" %d", cpu);
   pr_cont("\n");

[    2.165606] smp: MASKBITS: 0xaaaaaaaaaaaaaaaa
[    2.166574] smp: CPUs: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

and the same exercise with a 0x55 filled source bitmap.

[    2.167595] smp: MASKBITS: 0x5555555555555555
[    2.168568] smp: CPUs: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

So while cpumask_copy copied beyond NR_CPUs bits, for_each_cpu() does
the right thing simple because of this:

nr_cpu_ids is 32, right?

for_each_cpu(cpu, mask)
  for_each_set_bit(bit = cpu, addr = &mask, size = nr_cpu_ids)
for ((bit) = 0; (bit) = find_next_bit((addr), (size), (bit)), (bit) < (size); (bit)++)

So if find_next_bit() returns a bit after bit 31 the condition (bit) <
(size) will terminate the loop, right?

Again, you are assuming more CPUs than there are bits in the mask.

Also in the case of that driver the copy is _NOT_ copying set bits
beyond bit 31 simply because the source mask is cpu_online_mask which
definitely does not have a bit set which is greater than 31. As the copy
copies longs the resulting mask in af_desc.mask cannot have any bit set
past bit 31 either.

If the above is not correct, then there is a bigger problem than that
MLX5 driver crashing.

So this:

quoted

If I correct mlx5_ctrl_irq_request() to clear @af_desc before the
copy, this crash goes away.

does not make any sense to me.

Can you please add after the cpumask_copy() in that mlx5 code:

   pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]);
   pr_info("MASKBITS:   %016lx\n", af_desc.mask.bits[0]);

Both are 0000 0000 0000 0fff, as expected on a system
where 12 CPUs are present.

Please print also in irq_matrix_reserve_managed():

 - @mask->bits[0]
 - nr_cpu_ids
 - the CPU numbers inside the for_each_cpu() loop

Here's where it gets interesting:

+       pr_info("%s: MASKBITS:   %016lx\n", __func__, msk->bits[0]);
+       pr_info("%s: nr_cpu_ids=%u\n", __func__, nr_cpu_ids);

[   70.957400][ T1185] mlx5_core 0000:81:00.0: firmware version: 16.35.2000
[   70.964146][ T1185] mlx5_core 0000:81:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
[   71.260555][    C9] port_module: 1 callbacks suppressed
[   71.260561][    C9] mlx5_core 0000:81:00.0: Port module event: module 0, Cable plugged
[   71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS:   ffffb1a74686bcd8
[   71.274096][   T10] mlx5_core 0000:81:00.0: mlx5_pcie_event:301:(pid 10): PCIe slot advertised sufficient power (27W).
[   71.280844][ T1185] irq_matrix_reserve_managed: nr_cpu_ids=12
[   71.280846][ T1185] irq_matrix_reserve_managed: cm=ffff909aefcf0f48 cm->managed_map=ffff909aefcf0f80 cpu=3
[   71.280849][ T1185] irq_matrix_reserve_managed: cm=ffff909aefd30f48 cm->managed_map=ffff909aefd30f80 cpu=4
[   71.280851][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc30f48 cm->managed_map=ffff909eefc30f80 cpu=6
[   71.280852][ T1185] irq_matrix_reserve_managed: cm=ffff909eefc70f48 cm->managed_map=ffff909eefc70f80 cpu=7
[   71.280854][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd30f48 cm->managed_map=ffff909eefd30f80 cpu=10
[   71.280856][ T1185] irq_matrix_reserve_managed: cm=ffff909eefd70f48 cm->managed_map=ffff909eefd70f80 cpu=11
[   71.280858][ T1185] irq_matrix_reserve_managed: cm=ffffffff98ef3f48 cm->managed_map=ffffffff98ef3f80 cpu=12

Notice that there are in fact higher bits set than bit 11.

The lowest 16 bits of MASKBITS are bcd8, or in binary:

... 1011 1100 1101 1000

Starting from the low-order side: bits 3, 4, 6, 7, 10, 11, and
12, matching the CPU IDs from the loop. At bit 12, we fault,
since there is no CPU 12 on the system.


--
Chuck Lever

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help