From: David Moses <redacted> Sent: Friday, August 6, 2021 2:20 AM
quoted hunk
Hi Michael ,
We are running kernel 4.19.195 (The fix Wei Liu suggested of moving the
cpumask_empty check after disabling interrupts is included in this version).
with the default hyper-v version
I'm getting the 4 bytes garbage read (trace included) once almost every night
We running on Azure vm Standard D64s_v4 with 64 cores (Our system include
three of such Vms) the application is very high io traffic involving iscsi
We believe this issue casus us to stack corruption on the rt scheduler as I forward
in the previous mail.
Let us know what is more needed to clarify the problem.
Is it just Hyper-v related? or could be a general kernel issue.
Thx David
even more that that while i add the below patch/fix
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 5b58a6c..165727a 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -298,6 +298,9 @@ static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
> */
static inline int hv_cpu_number_to_vp_number(int cpu_number)
{
+ if (WARN_ON_ONCE(cpu_number < 0 || cpu_number >= num_possible_cpus()))
+ return VP_INVAL;
+
return hv_vp_index[cpu_number];
}
we have evidence that we reach this point
see below:
Aug 5 21:03:01 c-node11 kernel: [17147.089261] WARNING: CPU: 15 PID: 8973 at arch/x86/include/asm/mshyperv.h:301 hyperv_flush_tlb_others+0x1f7/0x760
Aug 5 21:03:01 c-node11 kernel: [17147.089275] RIP: 0010:hyperv_flush_tlb_others+0x1f7/0x760
Aug 5 21:03:01 c-node11 kernel: [17147.089275] Code: ff ff be 40 00 00 00 48 89 df e8 c4 ff 3a 00
85 c0 48 89 c2 78 14 48 8b 3d be 52 32 01 f3 48 0f b8 c7 39 c2 0f 82 7e 01 00 00 <0f> 0b ba ff ff ff ff
89 d7 48 89 de e8 68 87 7d 00 3b 05 66 54 32
Aug 5 21:03:01 c-node11 kernel: [17147.089275] RSP: 0018:ffff8c536bcafa38 EFLAGS: 00010046
Aug 5 21:03:01 c-node11 kernel: [17147.089275] RAX: 0000000000000040 RBX: ffff8c339542ea00 RCX: ffffffffffffffff
Aug 5 21:03:01 c-node11 kernel: [17147.089275] RDX: 0000000000000040 RSI: ffffffffffffffff RDI: ffffffffffffffff
Aug 5 21:03:01 c-node11 kernel: [17147.089275] RBP: ffff8c339878b000 R08: ffffffffffffffff R09: ffffe93ecbcaa0e8
Aug 5 21:03:01 c-node11 kernel: [17147.089275] R10: 00000000020e0000 R11: 0000000000000000 R12: ffff8c536bcafa88
Aug 5 21:03:01 c-node11 kernel: [17147.089275] R13: ffffe93efe1ef980 R14: ffff8c339542e600 R15: 00007ffcbc390000
Aug 5 21:03:01 c-node11 kernel: [17147.089275] FS: 00007fcb8eae37a0(0000) GS:ffff8c339f7c0000(0000) knlGS:0000000000000000
Aug 5 21:03:01 c-node11 kernel: [17147.089275] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 5 21:03:01 c-node11 kernel: [17147.089275] CR2: 000000000135d1d8 CR3: 0000004037137005 CR4: 00000000003606e0
Aug 5 21:03:01 c-node11 kernel: [17147.089275] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 5 21:03:01 c-node11 kernel: [17147.089275] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 5 21:03:01 c-node11 kernel: [17147.089275] Call Trace:
Aug 5 21:03:01 c-node11 kernel: [17147.089275] flush_tlb_mm_range+0xc3/0x120
Aug 5 21:03:01 c-node11 kernel: [17147.089275] ptep_clear_flush+0x3a/0x40
Aug 5 21:03:01 c-node11 kernel: [17147.089275] wp_page_copy+0x2e6/0x8f0
Aug 5 21:03:01 c-node11 kernel: [17147.089275] ? reuse_swap_page+0x13d/0x390
Aug 5 21:03:01 c-node11 kernel: [17147.089275] do_wp_page+0x99/0x4c0
Aug 5 21:03:01 c-node11 kernel: [17147.089275] __handle_mm_fault+0xb4e/0x12c0
Aug 5 21:03:01 c-node11 kernel: [17147.089275] ? memcg_kmem_get_cache+0x76/0x1a0
Aug 5 21:03:01 c-node11 kernel: [17147.089275] handle_mm_fault+0xd6/0x200
Aug 5 21:03:01 c-node11 kernel: [17147.089275] __get_user_pages+0x29e/0x780
Aug 5 21:03:01 c-node11 kernel: [17147.089275] get_user_pages_remote+0x12c/0x1b0
(FYI -- email to the Linux kernel mailing lists should be in plaintext format, and
not use HTML or other formatting.)
This is an excellent experiment. It certainly suggests that the cpumask that is
passed to hyperv_flush_tlb_others() has bits set for CPUs above 64 that don't exist.
If that's the case, it would seem to be a general kernel issue rather than something
specific to Hyper-V.
Since it looks like you can to add debugging code to the kernel, here are a couple
of thoughts:
1) In hyperv_flush_tlb_others() after the call to disable interrupts, check the value
of cpulast(cpus), and if it is greater than num_possible_cpus(), execute a printk()
statement that outputs the entire contents of the cpumask that is passed in. There's
a special printk format string for printing out bitmaps like cpumasks. Let me know
if you would like some help on this code -- I can provide a diff later today. Seeing
what the "bad" cpumask looks like might give some clues as to the problem.
2) As a different experiment, you can disable the Hyper-V specific flush routines
entirely. At the end of the mmu.c source file, have hyperv_setup_mmu_ops()
always return immediately. In this case, the generic Linux kernel flush routines
will be used instead of the Hyper-V ones. The code may be marginally slower,
but it will then be interesting to see if a problem shows up elsewhere.
But based on your experiment, I'm guessing that there's a general kernel issue
rather than something specific to Hyper-V.
Have you run 4.19 kernels previous to 4.19.195 that didn't have this problem? If
you have a kernel version that is good, the ultimate step would be to do
a bisect and find out where the problem was introduced in the 4.19-series. That
could take a while, but it would almost certainly identify the problematic
code change and would be beneficial to the Linux kernel community in
general.
Michael