Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory
From: David Hildenbrand <hidden>
Date: 2021-02-19 19:16:57
Also in:
linux-arch, linux-mips, linux-mm, lkml
quoted
It's interesting to know about commit 1e356fc14be ("mem-prealloc: reduce large guest start-up and migration time.", 2017-03-14). It seems for speeding up VM boot, but what I can't understand is why it would cause the delay of hugetlb accounting - I thought we'd fail even earlier at either fallocate() on the hugetlb file (when we use /dev/hugepages) or on mmap() of the memfd which contains the huge pages. See hugetlb_reserve_pages() and its callers. Or did I miss something?We should fail on mmap() when the reservation happens (unless MAP_NORESERVE is passed) I think.quoted
I think there's a special case if QEMU fork() with a MAP_PRIVATE hugetlbfs mapping, that could cause the memory accouting to be delayed until COW happens.That would be kind of weird. I'd assume the reservation gets properly done during fork() - just like for VM_ACCOUNT.quoted
However that's definitely not the case for QEMU since QEMU won't work at all as late as that point. IOW, for hugetlbfs I don't know why we need to populate the pages at all if we simply want to know "whether we do still have enough space".. And IIUC 2) above is the major issue you'd like to solve too.To avoid page faults at runtime on access I think. Reservation <= Preallocation.
I just learned that there is more to it: (test done on v5.9) # echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages # cat /sys/devices/system/node/node*/meminfo | grep HugePages_ Node 0 HugePages_Total: 512 Node 0 HugePages_Free: 512 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 # cat /proc/meminfo | grep HugePages_ HugePages_Total: 512 HugePages_Free: 512 HugePages_Rsvd: 0 HugePages_Surp: 0 # /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=mem0,size=1G,hugetlb=on,hugetlbsize=2M,policy=bind,host-nodes=0 -numa node,nodeid=0,memdev=mem0 -hda Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 -nographic -> works just fine # /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=mem0,size=1G,hugetlb=on,hugetlbsize=2M,policy=bind,host-nodes=1 -numa node,nodeid=0,memdev=mem0 -hda Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 -nographic -> Does not fail nicely but crashes! See https://bugzilla.redhat.com/show_bug.cgi?id=1686261 for something similar, however, it no longer applies like that on more recent kernels. Hugetlbfs reservations don't always protect you (especially with NUMA) - that's why e.g., libvirt always tells QEMU to prealloc. I think the "issue" is that the reservation happens on mmap(). mbind() runs afterwards. Preallocation saves you from that. I suspect something similar will happen with anonymous memory with mbind() even if we reserved swap space. Did not test yet, though. -- Thanks, David / dhildenb