Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO | virtualization

(off-list ancestor, not in this archive)

On 05.01.21 03:14, Liang Li wrote:
In our production environment, there are three main applications have such
requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
for best performance, they populate memory when starting up. For SPDK vhost,
we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
vhost 'live' upgrade, which is done by killing the old process and
starting a new
one with the new binary. In this case, we want the new process started as quick
as possible to shorten the service downtime. We really enable this feature
to speed up startup time for them  :)
Am I wrong or does using hugeltbfs/tmpfs ... i.e., a file not-deleted between shutting down the old instances and firing up the new instance just solve this issue?
You are right, it works for the SPDK vhost upgrade case.

Thanks for info on the use case!

All of these use cases either already use, or could use, huge pages
IMHO. It's not your ordinary proprietary gaming app :) This is where
pre-zeroing of huge pages could already help.
You are welcome.  For some historical reason, some of our services are
not using hugetlbfs, that is why I didn't start with hugetlbfs.

Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ...
creating a file and pre-zeroing it from another process, or am I missing
something important? At least for QEMU this should work AFAIK, where you
can just pass the file to be use using memory-backend-file.
If using another process to create a file, we can offload the overhead to
another process, and there is no need to pre-zeroing it's content, just
populating the memory is enough.
Right, if non-zero memory can be tolerated (e.g., for vms usually has to).
I mean there is no need to pre-zeroing the file content obviously in user space,
the kernel will do it when populating the memory.

If we do it that way, then how to determine the size of the file? it depends
on the RAM size of the VM the customer buys.
Maybe we can create a file
large enough in advance and truncate it to the right size just before the
VM is created. Then, how many large files should be created on a host?
That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually)
It depends on how the scheduling component is designed. Yes, you can put
10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on
another one. But if one type of them, e.g. 4C8G are sold out, customers
can't by more 4C8G VM while there are some free 2C4G VMs, the resource
reserved for them can be provided as 4C8G VMs
1. You can, just the startup time will be a little slower? E.g., grow
pre-allocated 4G file to 8G.

2. Or let's be creative: teach QEMU to construct a single
RAMBlock/MemoryRegion out of multiple tmpfs files. Works as long as you
don't go crazy on different VM sizes / size differences.

3. In your example above, you can dynamically rebalance as VMs are
getting sold, to make sure you always have "big ones" lying around you
can shrink on demand.

You must know there are a lot of functions in the kernel which can
be done in userspace. e.g. Some of the device emulations like APIC,
vhost-net backend which has userspace implementation.   :)
Bad or not depends on the benefits the solution brings.
From the viewpoint of a user space application, the kernel should
provide high performance memory management service. That's why
I think it should be done in the kernel.
As I expressed a couple of times already, I don't see why using
hugetlbfs and implementing some sort of pre-zeroing there isn't sufficient.

We really don't *want* complicated things deep down in the mm core if
there are reasonable alternatives.

-- 
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help

Possibly related (same subject, not in this thread)