Re: [RFC PATCH 0/3] hugetlb: add demote/split page functionality

From: "Zi Yan" <ziy@nvidia.com>
Date: 2021-03-10 17:37:49
Also in: lkml

On 10 Mar 2021, at 12:05, Michal Hocko wrote:

On Wed 10-03-21 11:46:57, Zi Yan wrote:

quoted

On 10 Mar 2021, at 11:23, Michal Hocko wrote:

quoted

On Mon 08-03-21 16:18:52, Mike Kravetz wrote:
[...]

quoted

Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages.  However, there are two issues with this approach:
1) This process can take quite some time, especially if allocation of
   the smaller pages is not immediate and requires migration/compaction.
2) There is no guarantee that the total size of smaller pages allocated
   will match the size of the larger page which was freed.  This is
   because the area freed by the larger page could quickly be
   fragmented.

I will likely not surprise to show some level of reservation. While your
concerns about reconfiguration by existing interfaces are quite real is
this really a problem in practice? How often do you need such a
reconfiguration?

Is this all really worth the additional code to something as tricky as
hugetlb code base?

quoted

 include/linux/hugetlb.h |   8 ++
 mm/hugetlb.c            | 199 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 204 insertions(+), 3 deletions(-)

-- 
2.29.2

The high level goal of this patchset seems to enable flexible huge page
allocation from a single pool, when multiple huge page sizes are available
to use. The limitation of existing mechanism is that user has to specify
how many huge pages he/she wants and how many gigantic pages he/she wants
before the actual use.

I believe I have understood this part. And I am not questioning that.
This seems useful. I am mostly asking whether we need such a
flexibility. Mostly because of the additional code and future
maintenance complexity which has turned to be a problem for a long time.
Each new feature tends to just add on top of the existing complexity.

I totally agree. This patchset looks to me like a partial functional
replication of splitting high order free pages to lower order ones in buddy
allocator. That is why I had the crazy idea below.

quoted

I just want to throw an idea here, please ignore if it is too crazy.
Could we have a variant buddy allocator for huge page allocations,
which only has available huge page orders in the free list? For example,
if user wants 2MB and 1GB pages, the allocator will only have order-9 and
order-19 pages; when order-9 pages run out, we can split order-19 pages;
if possible, adjacent order-9 pages can be merged back to order-19 pages.

I assume you mean to remove those pages from the allocator when they
are reserved rather than really used, right? I am not really sure how

No. The allocator maintains all the reserved pages for huge page allocations,
replacing existing cma_alloc or alloc_contig_pages. The kernel builds
the free list when pages are reserved either during boot time or runtime.

you want to deal with lower orders consuming/splitting too much from
higher orders which then makes those unusable for the use even though
they were preallocated for a specific workload. Another worry is that a
gap between 2MB and 1GB pages is just too big so a single 2MB request
from 1G pool will make the whole 1GB page unusable even when the smaller
pool needs few pages.

Yeah, the gap between 2MB and 1GB is large. The fragmentation will be
a problem. Maybe we do not need it right now, since this patchset does not
propose promoting/merging pages. Or we can reuse the existing
anti fragmentation mechanisms but with pageblock set to gigantic page size
in this pool.

I admit my idea is a much intrusive change, but I feel that more
functionality replications of core mm are added to hugetlb code, then why
not reuse the core mm code.


—
Best Regards,
Yan Zi

Attachments

signature.asc [application/pgp-signature] 854 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help