Thread (63 messages) 63 messages, 3 authors, 2021-03-26

Re: [PATCH v5 1/5] mm,memory_hotplug: Allocate memmap from the added memory range

From: Oscar Salvador <osalvador@suse.de>
Date: 2021-03-24 10:13:50
Also in: lkml

On Tue, Mar 23, 2021 at 11:11:58AM +0100, Michal Hocko wrote:
[Sorry for a long overdue review. I didn't have time to follow previous
versions so I am sorry if some of my concerns have been discussed
already]
No worries, let's go ;-)
I was playing with movable_node and performance implications back in
2017 (unfortunately I do not have specific numbers anymore) and the
setup was a bit extreme - a single node (0) with normal zones and all
other nodes with movable memory only. So not only struct pages but any
other kernel metadata were on a remote node. I remember I could see
clear performance drop scaling with the distance from node 0 somewhere
betweem 5-10% on kbuild bound on a movable node.
I see. Yes, it is a rather extreme case, but I think it clearly shows the impact
of having metadata structures on a non-local node.

 
In fact beginning of the memory block should be sufficient as sections
cannot be hotremoved without the rest of the memory block.
Sorry, I meant memory block here.
quoted
struct pages which back the allocated space then just need to be treated
carefully.

Implementation wise we will reuse vmem_altmap infrastructure to override
the default allocator used by __populate_section_memmap.
Part of the implementation also relies on memory_block structure gaining
a new field which specifies the number of vmemmap_pages at the beginning.
Here you are talking about memory block rather than section.
Yes, see above, it should have been memory block in both cases.
quoted
Hot-remove:

 We need to be careful when removing memory, as adding and
 removing memory needs to be done with the same granularity.
 To check that this assumption is not violated, we check the
 memory range we want to remove and if a) any memory block has
 vmemmap pages and b) the range spans more than a single memory
 block, we scream out loud and refuse to proceed.
Is this a real problem? If each memory block has its own vmemmap then we
should be just fine, no?
Not entirely.

Assume this:

- memory_block_size = 128MB
- add_memory(256MB) : no uses altmap because size != memory_block_size
- add_memory(128MB) : uses altmap

Now, when trying to remove the memory, we should construct the altmap to let
remove_pmd_table->free_hugepage_table() know that it needs to call vmem_altmap_free()
instead of free_pagetable() for those sections that were populated using altmap.

But that becomes trickier to handle if user does remove_memory(384MB) at once.

The only reasonable way I can think of is something like:

/*
 * Try to diferentiate which ranges used altmap when populating vmemmap,
 * and construct the altmap for those
 */
 loop(size / section_size)
  if (range_used altmap)
   arch_remove_memory(nid, start, size, altmap);
  else
   arch_remove_memory(nid, start, size, NULL);
   
But I do not think this is any better than make this scenario completely a NO-NO,
because in the end, this is asking for trouble.
And yes, normal qemu/barematal users does not have the hability to play these
kind of tricks, as baremetal has HW limitations and qemu creates a device for
every range you hot-add (so you are tied to that device when removing memory
as well), but other users e.g: virtio-mem can do that.
I would appreciate some more description of the patch itself. The above
outlines a highlevel problems and design. The patch is quite large and
it acts on several layers - physical hotplug, {on,off}lining and sysfs
layer.
Ok, will try to come up with something more complete wrt changelog.
Let me capture my thinking:
- from the top level 
- sysfs interfaces - memory block is extended to contain the number of
  vmemmap pages reserved from the beginning of the block for all
  memory sections belonging to the block.
yes
- add_memory_resource is the entry point to reserve the vmemmap space
  for the block. This is an opt-in feature (MHP_MEMMAP_ON_MEMORY) and
  there is no current user at this stage.
yes
- vmem_altmap is instructed to use the reserved vmemmap space as the
  backing storage for the vmemmap struct pages. Via arch_add_memory->
  __populate_section_memmap.
yes
- online_pages for some reason needs to know about the reserved vmemmap
  space. Why? It already knows the intial pfn to online. Why cannot
  caller simply alter both start pfn and nr_pages to online everything
  after the vmemmap space? This is somehow conflating the mem block
  concept deeper into onlining.
- the same applies to offlining.
Because some counters need not only the buddy_nr_pages, but the complete
range.

So, let us see what online_pages() do (offline_pages() as well but slightly
different in some areas)

- move_pfn_range_to_zone():
  1) Resize node and zone spanned pages
     * If we were only to pass the nr_pages without the vmemmap pages,
       node/zone's spanned pages would be wrong as vmemmap pages would not
       be accounted in there.

  2) Inits struct pages by memmap_init_range() and sets its migratetype
     * If we were only to pass the nr_pages without the vmemmap pages,
       vmemmap pages would be totally unitialized.
       We also set its migratetype to MIGRATE_UNMOVABLE.
       Previous versions initialize vmemmap pages in another place but
       there was a consensus to do it here.

 So on, this case, we have:

 if (nr_vmemmap_pages)
    move_pfn_range_to_zone(zone, pfn, nr_vmemmap_pages, NULL,
                                       MIGRATE_UNMOVABLE);
 move_pfn_range_to_zone(zone, buddy_start_pfn, buddy_nr_pages, NULL
                        MIGRATE_ISOLATE);

- Increment zone->present_pages
  * We need to account buddy_pages + vmemmap_pages here

- zone->zone_pgdat->node_present_pages
  * Same as above

- online_pages_range() (onlines the pages, __and__ the sections)
  * Here do not only need the (buddy_pages, end_pages), but (vmemmap_pages, end_pages)
    as well, because on one hand we do:

    online_pages_range()
    {
       for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
                (*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);

       online_mem_sections(start_pfn, end_pfn);
   }

   For the call to online_mem_sections, we need to whole range (including the vmemmap
   pages), otherwise, if a whole section only contains vmemmap pages, the section
   might be left marked as offline, and that is troublesome.


As I said, the same applies to offline_pages(), but with slightly tweaks here and
there because it handles somewhat different things.

I kind of understand to be reluctant to use vmemmap_pages terminology here, but
unfortunately we need to know about it.
We could rename nr_vmemmap_pages to offset_buddy_pages or something like that.

- finally hotremove - which is the most tricky part. try_remove_memory
  learns about vmemmap reserved space and provides it to __remove_pages
  and eventually all the way down to remove_pagetable via altmap
  Now a question and something I have stumbled over few years back when
  looking into this. Let's say you have multi section memblock so the
  first section of the block backs vmemmaps for all other sections.
  What happens when you drop the first worth of section before tearing
  down all other vmemmaps?
I guess you refer to the case were:

- memory_block_size: 1GB (8 sections)
[memory_block] : first 4096 pfns are for vmemmap

Nothing happens, but I see where your comment is comming from.

Back in 2017, in your prototype, there were two different things:

- try_remove_memory (I dunno how it was called back then) still worked
  with pages, not pfns
- arch_memory_memory() either did not have the altmap stuff, or we were
  not setting it properly, but I remember that in your prototype
  you were handling vmemmap pages in free_hugepage_table()->free_pagetable()
  being carefull to not free them.

Back then, when removing the first vmemmap backing further sections, when
then dereferencing those sections in free_pagetable(), we would crash because
the mapping was not there anymore.
This cannot longer happen.
[...]
quoted
@@ -185,10 +185,11 @@ memory_block_action(unsigned long start_section_nr, unsigned long action,
 
 	switch (action) {
 	case MEM_ONLINE:
-		ret = online_pages(start_pfn, nr_pages, online_type, nid);
+		ret = online_pages(start_pfn, nr_pages, nr_vmemmap_pages,
+				   online_type, nid);
 		break;
I would just offset start_pfn and nr_pages.
As stated above, this is not possible.

 
quoted
@@ -603,7 +606,7 @@ static int add_memory_block(unsigned long base_section_nr)
 	if (section_count == 0)
 		return 0;
 	return init_memory_block(memory_block_id(base_section_nr),
-				 MEM_ONLINE);
+				 MEM_ONLINE, 0);
This would deserve a comment.
	/* Early init code to create memory blocks for all the memory.
	 * Backed by bootmem struct pages so no vmemmap reserved space.
	 */
Ok, will add it.
quoted
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
-	unsigned long pfn;
+	unsigned long pfn = buddy_start_pfn;
+
+	/*
+	 * When using memmap_on_memory, the range might be unaligned as the
+	 * first pfns are used for vmemmap pages. Align it in case we need to.
+	 */
+	VM_BUG_ON(!IS_ALIGNED(pfn, pageblock_nr_pages));
No this is not something VM_BUG_ON should be used for. This is perfectly
recoverable situation. Besides that this is a wrong layer to care. All
the fixup should happen up in the call chain.
This should not happen anymore as mhp_support_memmap_on_memory() does not let
to use MHP_MEMMAP_ON_MEMORY if range is not pageblock_nr_pages.

So this can go.
quoted
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
-		       int online_type, int nid)
+		       unsigned long nr_vmemmap_pages, int online_type, int nid)
 {
-	unsigned long flags;
+	unsigned long flags, buddy_start_pfn, buddy_nr_pages;
 	struct zone *zone;
 	int need_zonelists_rebuild = 0;
 	int ret;
As already mentioned I believe this would be much easier to follow if
the given pfn really denotes a first pfn to online rather than learn the
code about vmemmap space which is not really interesting from the
onlining POV. Struct pages are already create. All we need is to online
them for using.
Just have a look at pfn vs. buddy_start_pfn usage. Why should
zone_for_pfn_range, node_states_check_changes_online, memory_notify ase
the former rather than later? As mentioned above online_pages_range is
just more complex by doing that.

Sure there are some consistency checks which are more convenient with
the actual pfn start but I believe those shouldn't be a reason for
obfuscating the code and mixing layers.
I think I explained this above, but let me repeat just in case.
Take into account that boot vmemmap_pages are also accounted to:
- zone's spanned_pages/present_pages
- node's spanned_pages/present_pages

And those pages are also initialized somehow, so we need to initialize the hotplug
vmemmap pages as well, and account them.

As I said, we can use a different terminology and name it different, but we need to
- properly account them
- properly initialize them

And I __guess__ we could do it somewhere off the {online,offline_pages()) land,
but I see that trickier and not worh it.
quoted
+	       IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
+	       size == memory_block_size_bytes() &&
+	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
+	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
This is likely more complex than necessary. Is it ever possible that
remaining_size won't be aligned properly when vmemmap_size is PMD_SIZE
aligned?
Yes, on arm64 with large pages depending on HUGETLB support this can lead to
one condition be true while the other not.
quoted
@@ -1563,10 +1639,11 @@ static int count_system_ram_pages_cb(unsigned long start_pfn,
 	return 0;
 }
 
-int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+			unsigned long nr_vmemmap_pages)
same concern as online pages. Nobody should really care about vmemmap
reserved space. Maybe the accounting (count_system_ram_pages_cb) will
need some compensation but I have to say I got lost in this accounting
wrt to memory hotplugged memory. Where do we account hotadded memory to
system_ram_pages?
Quick summary of account:

- online_pages()->move_pfn_range_to_zone(): Accounts for node/zone's spanned pages
- online_pages()->zone->present_pages += nr_pages;
- zone->zone_pgdat->node_present_pages += nr_pages;
- online_pages()->online_pages_range()->generic_online_page()->totalram_pages_add():
  Accounts for totalram_pages
- online_pages()->adjust_managed_page_count(): Accounts for zone->managed_pages

So, as you can see, we have a mix with of spanned_pages,present_pages and
managed_pages in both {offline,online}_pages().
Vmemmap pages need to be properly accounted to spanned_pages,present_pages,
as we account bootmem vmemmap pages, but they do not be accounted in
managed_pages.
This made me scratch my head. I do not think this works for size
spanning multiple memory blocks. Maybe we do not allow something like
that happening. The logic seems inside out to me. I believe you want to
either pull arch_remove_memory into the walk_memory_blocks callback and
handle each memory block this way.
Here, what we fence off is the scenario I mentioned at the beginning, where
someone may call try_remove_memory() with memory_blocks both containing and
not vmemmap_pages.

So, if the user opt-in the feature, he needs to work with the same granularity
in the add and remove operations.

Well, that was a good feedback indeed, and a large one, I hope to have clarified
some of the questions raised.

-- 
Oscar Salvador
SUSE L3
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help