Thread (35 messages) 35 messages, 9 authors, 2017-03-15

Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

From: Igor Mammedov <hidden>
Date: 2017-03-03 17:34:55
Also in: linux-acpi, linux-mm, linux-s390, lkml

On Fri, 3 Mar 2017 09:27:23 +0100
Michal Hocko [off-list ref] wrote:
On Thu 02-03-17 18:03:15, Igor Mammedov wrote:
quoted
On Thu, 2 Mar 2017 15:28:16 +0100
Michal Hocko [off-list ref] wrote:
  
quoted
On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
[...]  
quoted
When trying to support memory unplug on guest side in RHEL7,
experience shows otherwise. Simplistic udev rule which onlines
added block doesn't work in case one wants to online it as movable.

Hotplugged blocks in current kernel should be onlined in reverse
order to online blocks as movable depending on adjacent blocks zone.    
Could you be more specific please? Setting online_movable from the udev
rule should just work regardless of the ordering or the state of other
memblocks. If that doesn't work I would call it a bug.  
It's rather an implementation constrain than a bug
for details and workaround patch see
 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7  
"You are not authorized to access bug #1314306"
Sorry,
I've made it public, related comments and patch should be accessible now
(code snippets in BZ are based on older kernel but logic is still the same upstream)
 
could you paste the reasoning here please?
sure here is reproducer:
start VM with CLI:
  qemu-system-x86_64  -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \
  -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \
  /path/to/guest_image

then in guest dimm1 blocks are from 32-39

  echo online_movable > /sys/devices/system/memory/memory32/state
-bash: echo: write error: Invalid argument

in current mainline kernel it triggers following code path:

online_pages()
  ...
       if (online_type == MMOP_ONLINE_KERNEL) {                                 
                if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))    
                        return -EINVAL;

  zone_can_shift()
    ...
        if (idx < target) {                                                      
                /* pages must be at end of current zone */                       
                if (pfn + nr_pages != zone_end_pfn(zone))                        
                        return false;            

since we are trying to online as movable not the last section in
ZONE_NORMAL.

Here is what makes hotplugged memory end up in ZONE_NORMAL:
 acpi_memory_enable_device() -> add_memory -> add_memory_resource ->
   -> arch/x86/mm/init_64.c  

     /*
      * Memory is added always to NORMAL zone. This means you will never get
      * additional DMA/DMA32 memory.
      */
     int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
     {
        ...
        struct zone *zone = pgdat->node_zones +
                zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);

i.e. all hot-plugged memory modules always go to ZONE_NORMAL
and only the first/last block in zone is allowed to be moved
to another zone. Patch [1] tries to fix issue by assigning
removable memory resource to movable zone so hotplugged+removable
blocks look like:
  movable normal, movable, movable
instead of current:
  normal, normal, normal movable

but then with this fixed as suggested, auto online by default
should work just fine in kernel with normal and movable zones
without any need for user-space.
quoted
patch attached there is limited by another memory hotplug
issue, which is NORMAL/MOVABLE zone balance, if kernel runs
on configuration where the most of memory is hot-removable
kernel might experience lack of memory in zone NORMAL.  
yes and that is an inherent problem of movable memory.
quoted
quoted
quoted
Which means simple udev rule isn't usable since it gets event from
the first to the last hotplugged block order. So now we would have
to write a daemon that would
 - watch for all blocks in hotplugged memory appear (how would it know)
 - online them in right order (order might also be different depending
   on kernel version)
   -- it becomes even more complicated in NUMA case when there are
      multiple zones and kernel would have to provide user-space
      with information about zone maps

In short current experience shows that userspace approach
 - doesn't solve issues that Vitaly has been fixing (i.e. onlining
   fast and/or under memory pressure) when udev (or something else
   might be killed)    
yeah and that is why the patch does the onlining from the kernel.  
onlining in this patch is limited to hyperv and patch breaks
auto-online on x86 kvm/vmware/baremetal as they reuse the same
hotplug path.  
Those can use the udev or do you see any reason why they couldn't?
Reasons are above, under >>>> and >> quotations, patch breaks
what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some
user-space process could be killed if hotplugged memory isn't onlined
fast enough leading to service termination and/or memory not
being onlined at all (if udev is killed)

Currently udev rule is not usable and one needs a daemon
which would correctly do onlining and keep zone balance
even for simple case usecase of 1 normal and 1 movable zone.
And it gets more complicated in case of multiple numa nodes
with multiple zones.
quoted
quoted
quoted
quoted
Can you imagine any situation when somebody actually might want to have
this knob enabled? From what I understand it doesn't seem to be the
case.    
For x86:
 * this config option is enabled by default in recent Fedora,    
How do you want to support usecases which really want to online memory
as movable? Do you expect those users to disable the option because
unless I am missing something the in kernel auto onlining only supporst
regular onlining.  
current auto onlining config option does what it's been designed for,
i.e. it onlines hotplugged memory.
It's possible for non average Fedora user to override default
(commit 86dd995d6) if she/he needs non default behavior
(i.e. user knows how to online manually and/or can write
a daemon that would handle all of nuances of kernel in use).

For the rest when Fedora is used in cloud and user increases memory
via management interface of whatever cloud she/he uses, it just works.

So it's choice of distribution to pick its own default that makes
majority of user-base happy and this patch removes it without taking
that in consideration.  
You still can have a udev rule to achive the same thing for
non-ballooning based hotplug.
not in case when system is under load, udev path might be slow
and udev might be killed by OOM leading to permanent disablement
of memory onlining.
 
quoted
How to online memory is different issue not related to this patch,
current default onlining as ZONE_NORMAL works well for scaling
up VMs.

Memory unplug is rather new and it doesn't work reliably so far,
moving onlining to user-space won't really help. Further work
is need to be done so that it would work reliably.  
The main problem I have with this is that this is a limited usecase
driven configuration knob which doesn't work properly for other usecases
(namely movable online once your distribution choses to set the config
option to auto online).
it works for default usecase in Fedora and non-default
movable can be used with
 1) removable memory auto-online as movable in kernel, like
    patch [1] would make movable hotplugged memory
    (when I have time I'll try to work on it)
 2) (or in worst case due to lack of alternative) explicitly
    disabled auto-online on kernel CLI + onlining daemon 
    (since udev isn't working in current kernel due to ordering issue)
There is a userspace solution for this so this
shouldn't have been merged in the first place!
Sorry, currently user-space udev solution doesn't work nor
will it work reliably in extreme conditions.
It sneaked a proper review
process (linux-api wasn't CC to get a broader attenttion) which is
really sad.
get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e,
MAINTAINERS should be fixed if linux-api were to be CCed.
So unless this causes a major regression which would be hard to fix I
will submit the patch for inclusion.
it will be a major regression due to lack of daemon that
could online fast and can't be killed on OOM. So this
clean up patch does break used feature without providing
a viable alternative.

I wouldn't object to removing config option as in this patch
if memory were onlined for x86 by default but that's not the case yet.


[1] https://bugzilla.redhat.com/attachment.cgi?id=1146332

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help