RE: hv_balloon: kmsg about unhandled message is killing the system

From: Michael Kelley (LINUX) <hidden>
Date: 2021-12-13 05:54:23

From: Thomas Deutschmann <redacted> Sent: Wednesday, October 20, 2021 8:10 AM

To: linux-hyperv@vger.kernel.org
Subject: hv_balloon: kmsg about unhandled message is killing the system

Hi,

I am running a Hyper-V Gen2 VM with Gentoo Linux where I make use of the
memory ballooning feature (8192MB RAM Minimum; 61440MB RAM Maximum; 20%
memory buffer) for almost 2 years. Since kernel 5.14, the virtual
machine will sometimes log _a lot_ of

quoted

kernel: [ 1022.277623] hv_balloon: Unhandled message: type: 0
kernel: [ 1022.277624] hv_balloon: Unhandled message: type: 32768
kernel: [ 1022.277625] hv_balloon: Unhandled message: type: 51200
kernel: [ 1022.277625] hv_balloon: Unhandled message: type: 59392
kernel: [ 1022.277689] hv_balloon: Ballooned pages: 1519104

messages, causing log mountpoint (in in my case root mountpoint) to run
out of disk space which will kill the system in the end.

I have never seen this before with any <5.14 kernel.

Of course, I tried to bisect the kernel multiple times, but I never was
successful because it is not easy to trigger the problem. What seems to
work best:

1) After start, wait ~60 seconds for

quoted

hv_balloon: Max. dynamic memory size: 61440 MB

message.

2) Now allocate some memory causing the VM to request more memory from
the host system:

   $ </dev/zero head -c 22G | pv -L 256M | tail

   (Note: You have to do that slowly because host will only grant
          more memory when memory pressure is constantly high
          but when you are requesting memory too fast you will
          run out of memory)

3) Now end the process (CTRL+C) and wait until the VM has returned
memory back to host system.

4) Now I start to compile chromium and firefox with 20 threads each in
parallel.

If the kernel is faulty, in most cases I'll see the kmsgs about
unhandled message types within 10 minutes. If I'll get the message

quoted

hv_balloon: Balloon request will be partially fulfilled. Balloon floor reached

it's usually a sign for working kernel.

But as said at the beginning, this is not 100% reliable. I already ended
up with a kernel where I thought "This revision is fine" and suddenly
the system died because millions of those messages were outputted. Or
sometimes I am unable to trigger the failure again for a bad revision.
See my last bisect attempt:

My apologies that someone did not get back to you sooner on this issue.
Someone has recently found a bug that is the likely cause.  See
https://lore.kernel.org/linux-hyperv/20211213014709.GA2316@anparri/T/#t (local)
if you haven't already.  I think the proposed fix works, but there may
be some additional discussion about whether it is the best fix.

Michael Kelley

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help