RE: hv_balloon: kmsg about unhandled message is killing the system
From: Michael Kelley (LINUX) <hidden>
Date: 2021-12-13 05:54:23
From: Thomas Deutschmann <redacted> Sent: Wednesday, October 20, 2021 8:10 AM
To: linux-hyperv@vger.kernel.org Subject: hv_balloon: kmsg about unhandled message is killing the system Hi, I am running a Hyper-V Gen2 VM with Gentoo Linux where I make use of the memory ballooning feature (8192MB RAM Minimum; 61440MB RAM Maximum; 20% memory buffer) for almost 2 years. Since kernel 5.14, the virtual machine will sometimes log _a lot_ ofquoted
kernel: [ 1022.277623] hv_balloon: Unhandled message: type: 0 kernel: [ 1022.277624] hv_balloon: Unhandled message: type: 32768 kernel: [ 1022.277625] hv_balloon: Unhandled message: type: 51200 kernel: [ 1022.277625] hv_balloon: Unhandled message: type: 59392 kernel: [ 1022.277689] hv_balloon: Ballooned pages: 1519104messages, causing log mountpoint (in in my case root mountpoint) to run out of disk space which will kill the system in the end. I have never seen this before with any <5.14 kernel. Of course, I tried to bisect the kernel multiple times, but I never was successful because it is not easy to trigger the problem. What seems to work best: 1) After start, wait ~60 seconds forquoted
hv_balloon: Max. dynamic memory size: 61440 MBmessage. 2) Now allocate some memory causing the VM to request more memory from the host system: $ </dev/zero head -c 22G | pv -L 256M | tail (Note: You have to do that slowly because host will only grant more memory when memory pressure is constantly high but when you are requesting memory too fast you will run out of memory) 3) Now end the process (CTRL+C) and wait until the VM has returned memory back to host system. 4) Now I start to compile chromium and firefox with 20 threads each in parallel. If the kernel is faulty, in most cases I'll see the kmsgs about unhandled message types within 10 minutes. If I'll get the messagequoted
hv_balloon: Balloon request will be partially fulfilled. Balloon floor reachedit's usually a sign for working kernel. But as said at the beginning, this is not 100% reliable. I already ended up with a kernel where I thought "This revision is fine" and suddenly the system died because millions of those messages were outputted. Or sometimes I am unable to trigger the failure again for a bad revision. See my last bisect attempt:
My apologies that someone did not get back to you sooner on this issue. Someone has recently found a bug that is the likely cause. See https://lore.kernel.org/linux-hyperv/20211213014709.GA2316@anparri/T/#t (local) if you haven't already. I think the proposed fix works, but there may be some additional discussion about whether it is the best fix. Michael Kelley