Thread (33 messages) 33 messages, 8 authors, 2023-06-02

Re: system hang on start-up (mlx5?)

From: Thomas Gleixner <hidden>
Date: 2023-05-30 22:18:05
Also in: linux-rdma

On Tue, May 30 2023 at 21:48, Chuck Lever III wrote:
quoted
On May 30, 2023, at 3:46 PM, Thomas Gleixner [off-list ref] wrote:
cpumask_copy(d, s)
  bitmap_copy(d, s, nbits = 32)
    len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);

So it copies as many longs as required to cover nbits, i.e. it copies
any clobbered bits beyond nbits too. While that looks odd at the first
glance, that's just an optimization which is harmless.

for_each_cpu() finds the next set bit in a mask and breaks the loop once
bitnr >= small_cpumask_bits, which is nr_cpu_ids and should be 32 too.

I just booted a kernel with NR_CPUS=32:
My system has only 12 CPUs. So every bit in your mask represents
a present CPU, but on my system, only 0x00000fff are ever present.

Therefore, on my system, any bit higher than bit 11 in a CPU mask
will reference a CPU that is not present.
Correct....

Sorry, I missed the part that your machine has only 12 CPUs....

Now I can reproduce the wreckage even with that trivial test I did:

[    0.210089] setup_percpu: NR_CPUS:32 nr_cpumask_bits:12 nr_cpu_ids:12 nr_node_ids:1
...
[    0.606591] smp: MASKBITS: 5555555555555555
[    0.607026] smp: CPUs: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

I'm way too tired to make sense of that right now. Will have a look at
it tomorrow with brain awake unless you beat me to it.

That's one mystery but the other one is this:

[   71.273798][ T1185] irq_matrix_reserve_managed: MASKBITS:   ffffb1a74686bcd8

That's clearly a kernel address within the direct map. How does that end
up as content of a cpumask?

Thanks,

        tglx
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help