Re: [External] Re: [PATCH v16 4/9] mm: hugetlb: alloc the vmemmap pages... | linux-doc

Re: [External] Re: [PATCH v16 4/9] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page

From: Muchun Song <hidden>
Date: 2021-02-23 10:28:32
Also in: linux-fsdevel, linux-mm, lkml

On Tue, Feb 23, 2021 at 5:28 PM Oscar Salvador [off-list ref] wrote:

On Mon, Feb 22, 2021 at 04:00:27PM -0800, Mike Kravetz wrote:

quoted

-static void update_and_free_page(struct hstate *h, struct page *page)
+static int update_and_free_page(struct hstate *h, struct page *page)
+   __releases(&hugetlb_lock) __acquires(&hugetlb_lock)
 {
    int i;
+   int nid = page_to_nid(page);

    if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
-           return;
+           return 0;

    h->nr_huge_pages--;
-   h->nr_huge_pages_node[page_to_nid(page)]--;
+   h->nr_huge_pages_node[nid]--;
+   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
+   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
+   set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
+   set_page_refcounted(page);

I think you added the set_page_refcounted() because the huge page will
appear as just a compound page without a reference after dropping the
hugetlb lock?  It might be better to set the reference before modifying
the destructor.  Otherwise, page scanning code could find the non-hugetlb
compound page with no reference.  I could not find any code where this
would be a problem, but I think it would be safer to set the reference
first.

But we already had set_page_refcounted() before this patchset there.
Are the worries only because we drop the lock? AFAICS, the "page-scanning"
problem could have happened before as well?
Although, what does page scanning mean in this context?

I am not opposed to move it above, but I would like to understand the concern
here.

quoted

+   spin_unlock(&hugetlb_lock);

I really like the way this code is structured.  It is much simpler than
previous versions with retries or workqueue.  There is nothing wrong with
always dropping the lock here.  However, I wonder if we should think about
optimizing for the case where this feature is not enabled and we are not
freeing a 1G huge page.  I suspect this will be the most common case for
some time, and there is no need to drop the lock in this case.

Please do not change the code based on my comment.  I just wanted to bring
this up for thought.

Is it as simple as checking?
        if (free_vmemmap_pages_per_hpage(h) || hstate_is_gigantic(h))
                spin_unlock(&hugetlb_lock);

        /* before return */
        if (free_vmemmap_pages_per_hpage(h) || hstate_is_gigantic(h))
                spin_lock(&hugetlb_lock);

AFAIK, we at least need the hstate_is_gigantic? Comment below says that
free_gigantic_page might block, so we need to drop the lock.
And I am fine with the change overall.

Unless I am missing something, we should not need to drop the lock unless
we need to allocate vmemmap pages (apart from gigantic pages).

quoted

+
+   if (alloc_huge_page_vmemmap(h, page)) {
+           int zeroed;
+
+           spin_lock(&hugetlb_lock);
+           INIT_LIST_HEAD(&page->lru);
+           set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
+           h->nr_huge_pages++;
+           h->nr_huge_pages_node[nid]++;

I think prep_new_huge_page() does this for us?

Actually, there are some differences. e.g. prep_new_huge_page()
will reset hugetlb cgroup and ClearHPageFreed, but we do not need
them here. And prep_new_huge_page will acquire and release
the hugetlb_lock. But here we also need hold the lock to update
the surplus counter and enqueue the page to the free list.
So I do not think reuse prep_new_huge_page is a good idea.

quoted

+
+           /*
+            * If we cannot allocate vmemmap pages, just refuse to free the
+            * page and put the page back on the hugetlb free list and treat
+            * as a surplus page.
+            */
+           h->surplus_huge_pages++;
+           h->surplus_huge_pages_node[nid]++;
+
+           /*
+            * This page is now managed by the hugetlb allocator and has
+            * no users -- drop the last reference.
+            */
+           zeroed = put_page_testzero(page);
+           VM_BUG_ON_PAGE(!zeroed, page);

Can this actually happen? AFAIK, page landed in update_and_free_page should be
zero refcounted, then we increase the reference, and I cannot see how the
reference might have changed in the meantime.

I am not sure whether other modules get the page and then put the
page. I see gather_surplus_pages does the same thing. So I copied
from there. I try to look at the memory_failure routine.


CPU0:                           CPU1:
                                set_compound_page_dtor(HUGETLB_PAGE_DTOR);
memory_failure_hugetlb
  get_hwpoison_page
    __get_hwpoison_page
      get_page_unless_zero
                                put_page_testzero()

Maybe this can happen. But it is a very corner case. If we want to
deal with this. We can put_page_testzero() first and then
set_compound_page_dtor(HUGETLB_PAGE_DTOR).

I am all for catching corner cases, but not sure how realistic this is.
Moreover, if we __ever__ get there, things can get nasty.

We basically will have an in-use page in the free hugetlb pool, so corruption
will happen. At that point, a plain BUG_ON might be better.

But as I said, I do not think we need that.

I yet need to look further, but what I have seen so far looks good.

--
Oscar Salvador
SUSE L3

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help