Re: [PATCH v2 2/2] mm/hugetlb: refactor subpage recording

From: Mike Kravetz <hidden>
Date: 2021-02-13 21:05:50
Also in: lkml

On 2/13/21 7:44 AM, Zi Yan wrote:

quoted hunk ↗ jump to hunk

On 11 Feb 2021, at 18:44, Mike Kravetz wrote:

quoted

On 2/11/21 12:47 PM, Zi Yan wrote:

quoted

On 28 Jan 2021, at 16:53, Mike Kravetz wrote:

quoted

On 1/28/21 10:26 AM, Joao Martins wrote:

quoted

For a given hugepage backing a VA, there's a rather ineficient
loop which is solely responsible for storing subpages in GUP
@pages/@vmas array. For each subpage we check whether it's within
range or size of @pages and keep increment @pfn_offset and a couple
other variables per subpage iteration.

Simplify this logic and minimize the cost of each iteration to just
store the output page/vma. Instead of incrementing number of @refs
iteratively, we do it through pre-calculation of @refs and only
with a tight loop for storing pinned subpages/vmas.

Additionally, retain existing behaviour with using mem_map_offset()
when recording the subpages for configurations that don't have a
contiguous mem_map.

pinning consequently improves bringing us close to
{pin,get}_user_pages_fast:

  - 16G with 1G huge page size
  gup_test -f /mnt/huge/file -m 16384 -r 30 -L -S -n 512 -w

PIN_LONGTERM_BENCHMARK: ~12.8k us -> ~5.8k us
PIN_FAST_BENCHMARK: ~3.7k us

Signed-off-by: Joao Martins <redacted>
---
 mm/hugetlb.c | 49 ++++++++++++++++++++++++++++---------------------
 1 file changed, 28 insertions(+), 21 deletions(-)

Thanks for updating this.

Reviewed-by: Mike Kravetz <redacted>

I think there still is an open general question about whether we can always
assume page structs are contiguous for really big pages.  That is outside

I do not think page structs need to be contiguous, but PFNs within a big page
need to be contiguous, at least based on existing code like mem_map_offset() we have.

Thanks for looking Zi,
Yes, PFNs need to be contiguous.  Also, as you say page structs do not need
to be contiguous.  The issue is that there is code that assumes page structs
are contiguous for gigantic pages.  hugetlb code does not make this assumption
and does a pfn_to_page() when looping through page structs for gigantic pages.

I do not believe this to be a huge issue.  In most cases CONFIG_VIRTUAL_MEM_MAP
is defined and struct pages can be accessed contiguously.  I 'think' we could
run into problems with CONFIG_SPARSEMEM and without CONFIG_VIRTUAL_MEM_MAP
and doing hotplug operations.  However, I still need to look into more.

Yeah, you are right about this. The combination of CONFIG_SPARSEMEM,
!CONFIG_SPARSEMEM_VMEMMAP and doing hotplug does cause errors, as simple as
dynamically reserving gigantic hugetlb pages then freeing them in a system
with CONFIG_SPARSEMEM_VMEMMAP not set and some hotplug memory.

Here are the steps to reproduce:
0. Configure a kernel with CONFIG_SPARSEMEM_VMEMMAP not set.
1. Create a VM using qemu with “-m size=8g,slots=16,maxmem=16g” to enable hotplug.
2. After boot the machine, add large enough memory using
   “object_add memory-backend-ram,id=mem1,size=7g” and
   “device_add pc-dimm,id=dimm1,memdev=mem1”.
3. In the guest OS, online all hot-plugged memory. My VM has 128MB memory block size.
If you have larger memory block size, I think you will need to plug in more memory.
4. Reserve gigantic hugetlb pages so that hot-plugged memory will be used. I reserved
12GB, like “echo 12 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages”.
5. Free all hugetlb gigantic pages,
“echo 0 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages”.
6. You will get “BUG: Bad page state in process …” errors.

The patch below can fix the error, but I suspect there might be other places missing
the necessary mem_map_next() too.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdb58ab14cb..aae99c6984f3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c

@@ -1319,7 +1319,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
        h->nr_huge_pages--;
        h->nr_huge_pages_node[page_to_nid(page)]--;
        for (i = 0; i < pages_per_huge_page(h); i++) {
-               page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
+               struct page *subpage = mem_map_next(subpage, page, i);
+               subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
                                1 << PG_referenced | 1 << PG_dirty |
                                1 << PG_active | 1 << PG_private |
                                1 << PG_writeback);

—
Best Regards,
Yan Zi

Thanks for confirming my suspicions Zi!

I thought hugetlb code always handled this situation, but was obviously
incorrect.

Perhaps the bigger issue is the GUP code which has the same problem as
suspected by Joao when we were discussing the first version of this patch.
It also is going traverse the list of page structs with page++.  I'll
point the people on that original thread to your findings here.
-- 
Mike Kravetz

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help