Thread (9 messages) 9 messages, 3 authors, 2021-11-05

Re: [PATCH v4] xen/balloon: add late_initcall_sync() for initial ballooning done

From: Juergen Gross <jgross@suse.com>
Date: 2021-11-04 16:45:01
Also in: lkml, stable, xen-devel

On 04.11.21 17:34, Boris Ostrovsky wrote:
On 11/4/21 12:21 PM, Juergen Gross wrote:
quoted
On 04.11.21 16:55, Boris Ostrovsky wrote:
quoted
On 11/3/21 9:55 PM, Boris Ostrovsky wrote:
quoted
On 11/2/21 5:19 AM, Juergen Gross wrote:
quoted
When running as PVH or HVM guest with actual memory < max memory the
hypervisor is using "populate on demand" in order to allow the guest
to balloon down from its maximum memory size. For this to work
correctly the guest must not touch more memory pages than its target
memory size as otherwise the PoD cache will be exhausted and the guest
is crashed as a result of that.

In extreme cases ballooning down might not be finished today before
the init process is started, which can consume lots of memory.

In order to avoid random boot crashes in such cases, add a late init
call to wait for ballooning down having finished for PVH/HVM guests.

Warn on console if initial ballooning fails, panic() after stalling
for more than 3 minutes per default. Add a module parameter for
changing this timeout.

Cc: <redacted>
Reported-by: Marek Marczykowski-Górecki 
[off-list ref]
Signed-off-by: Juergen Gross <jgross@suse.com>


Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

This appears to have noticeable effect on boot time (and boot 
experience in general).


I have


   memory=1024
   maxmem=8192


And my boot time (on an admittedly slow box) went from 33 to 45 
seconds. And boot pauses in the middle while it is waiting for 
ballooning to complete.


[    5.062714] xen:balloon: Waiting for initial ballooning down 
having finished.
[    5.449696] random: crng init done
[   34.613050] xen:balloon: Initial ballooning down finished.
This shows that before it was just by chance that the PoD cache wasn't
exhausted.

True.

quoted
quoted
So at least I think we should consider bumping log level down from info.
Which level would you prefer? warn?
Notice? Although that won't make much difference as WARN is the default 
level.
Right. That was my thinking.
I suppose we can't turn scrubbing off at this point?
I don't think we can be sure a ballooned page wasn't in use before. And
it could contain some data e.g. from the loaded initrd, maybe even put
there by the boot loader. So no, I wouldn't want to do that by default.

We could add another value to the xen_scrub_pages boot parameter, like
xen_scrub_pages=not-at-boot or some such. But this should be another
patch. And it should be documented that initrd or kernel data might
leak.
quoted
And if so, would you mind doing this while committing (I have one day
off tomorrow)?

Yes, of course.
Thanks.


Juergen

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help