Thread (35 messages) 35 messages, 9 authors, 2017-03-15

Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

From: Igor Mammedov <hidden>
Date: 2017-03-07 12:42:07
Also in: linux-acpi, linux-mm, linux-s390, lkml

On Mon, 6 Mar 2017 15:54:17 +0100
Michal Hocko [off-list ref] wrote:
On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
quoted
On Fri, 3 Mar 2017 09:27:23 +0100
Michal Hocko [off-list ref] wrote:
  
quoted
On Thu 02-03-17 18:03:15, Igor Mammedov wrote:  
quoted
On Thu, 2 Mar 2017 15:28:16 +0100
Michal Hocko [off-list ref] wrote:
    
quoted
On Thu 02-03-17 14:53:48, Igor Mammedov wrote:
[...]
quoted
quoted
quoted
quoted
memblocks. If that doesn't work I would call it a bug.    
It's rather an implementation constrain than a bug
for details and workaround patch see
 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7    
"You are not authorized to access bug #1314306"  
Sorry,
I've made it public, related comments and patch should be accessible now
(code snippets in BZ are based on older kernel but logic is still the same upstream)
   
quoted
could you paste the reasoning here please?  
sure here is reproducer:
start VM with CLI:
  qemu-system-x86_64  -enable-kvm -m size=1G,slots=2,maxmem=4G -numa node \
  -object memory-backend-ram,id=m1,size=1G -device pc-dimm,node=0,memdev=m1 \
  /path/to/guest_image

then in guest dimm1 blocks are from 32-39

  echo online_movable > /sys/devices/system/memory/memory32/state
-bash: echo: write error: Invalid argument

in current mainline kernel it triggers following code path:

online_pages()
  ...
       if (online_type == MMOP_ONLINE_KERNEL) {                                 
                if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))    
                        return -EINVAL;  
Are you sure? I would expect MMOP_ONLINE_MOVABLE here
pretty much, reproducer is above so try and see for yourself

[...]
[...]
quoted
quoted
quoted
quoted
quoted
Which means simple udev rule isn't usable since it gets event from
the first to the last hotplugged block order. So now we would have
to write a daemon that would
 - watch for all blocks in hotplugged memory appear (how would it know)
 - online them in right order (order might also be different depending
   on kernel version)
   -- it becomes even more complicated in NUMA case when there are
      multiple zones and kernel would have to provide user-space
      with information about zone maps

In short current experience shows that userspace approach
 - doesn't solve issues that Vitaly has been fixing (i.e. onlining
   fast and/or under memory pressure) when udev (or something else
   might be killed)      
yeah and that is why the patch does the onlining from the kernel.    
onlining in this patch is limited to hyperv and patch breaks
auto-online on x86 kvm/vmware/baremetal as they reuse the same
hotplug path.    
Those can use the udev or do you see any reason why they couldn't?  
Reasons are above, under >>>> and >> quotations, patch breaks
what Vitaly's fixed (including kvm/vmware usecases) i.e. udev/some
user-space process could be killed if hotplugged memory isn't onlined
fast enough leading to service termination and/or memory not
being onlined at all (if udev is killed)  
OK, so from the discussion so far I have learned that this would be
problem _only_ if we are trying to hotplug a _lot_ of memory at once
(~1.5% of the online memory is needed).  I am kind of skeptical this is
a reasonable usecase. Who is going to hotadd 8G to 256M machine (which
would eat half of the available memory which is still quite far from
OOM)? Even if the memory balloning uses hotplug then such a grow sounds
a bit excessive.
Slow and killable udev issue doesn't really depends on
amount of hotplugged memory since it's onlined in blocks
(128M for x64). Considering that it's currently onlined
as zone normal, kernel doesn't have any issues adding more
follow up blocks of memory.
quoted
Currently udev rule is not usable and one needs a daemon
which would correctly do onlining and keep zone balance
even for simple case usecase of 1 normal and 1 movable zone.
And it gets more complicated in case of multiple numa nodes
with multiple zones.  
That sounds to be more related to the current implementation than
anything else and as such is not a reason to invent specific user
visible api. Btw. you are talking about movable zones byt the auto
onlining doesn't allow to auto online movable memory. So either I miss
your point or I am utterly confused.
in current state neither does udev rule as memory is onlined
as NORMAL (x64 variant at least), which is the same as auto online
does now.

We are discussing 2 different issues here and thread got pretty
hard to follow. I'll try to sum up at results the end.
[...]
quoted
quoted
quoted
Memory unplug is rather new and it doesn't work reliably so far,
moving onlining to user-space won't really help. Further work
is need to be done so that it would work reliably.    
The main problem I have with this is that this is a limited usecase
driven configuration knob which doesn't work properly for other usecases
(namely movable online once your distribution choses to set the config
option to auto online).  
it works for default usecase in Fedora and non-default
movable can be used with
 1) removable memory auto-online as movable in kernel, like
    patch [1] would make movable hotplugged memory
    (when I have time I'll try to work on it)
 2) (or in worst case due to lack of alternative) explicitly
    disabled auto-online on kernel CLI + onlining daemon 
    (since udev isn't working in current kernel due to ordering issue)  
So I fail to see how this can work. Say the default will auto online
all the added memory. This will be all in Zone Normal because the auto
onlining doesn't do online_movable. Now I would like to online the
memory as movable but that might fail because some kernel objects might
be already sitting there so the offline would fail.
same happens with udev rule as currently it can online from
the first to last hotplugged block only as Zone Normal.
see
  arch/x86/mm/init_64.c:
      acpi_memory_enable_device() -> add_memory -> add_memory_resource

both auto online and udev could be fixed by patch
https://bugzilla.redhat.com/attachment.cgi?id=1146332
to online reomovable memory as movable.
quoted
quoted
There is a userspace solution for this so this
shouldn't have been merged in the first place!  
Sorry, currently user-space udev solution doesn't work nor
will it work reliably in extreme conditions.
  
quoted
It sneaked a proper review
process (linux-api wasn't CC to get a broader attenttion) which is
really sad.  
get_maintainer.pl doesn't lists linux-api for 31bc3858ea3e,
MAINTAINERS should be fixed if linux-api were to be CCed.  
user visible APIs _should_ be discussed at this mailing list regardless
what get_maintainer.pl says. This is not about who is the maintainer but
about getting as wide audience for things that would have to be
maintained basically for ever.
How would random contributor know which list to CC?
quoted
quoted
So unless this causes a major regression which would be hard to fix I
will submit the patch for inclusion.  
it will be a major regression due to lack of daemon that
could online fast and can't be killed on OOM. So this
clean up patch does break used feature without providing
a viable alternative.  
So let's discuss the current memory hotplug shortcomings and get rid of
the crud which developed on top. I will start by splitting up the patch
into 3 parts. Do the auto online thing from the HyperV and xen balloning
drivers and dropping the config option and finally drop the sysfs knob.
The last patch might be NAKed and I can live with that as long as the
reasoning is proper and there is a general consensus on that.
PS: CC me on that patches too

It's major regression if you remove auto online in kernels that
run on top of x86 kvm/vmware hypervisors, making API cleanups
while breaking useful functionality doesn't make sense.

I would ACK config option removal if auto online keeps working
for all x86 hypervisors (hyperv/xen isn't the only who needs it)
and keep kernel CLI option to override default.

That doesn't mean that others will agree with flipping default,
that's why config option has been added.

Now to sum up what's been discussed on this thread, there were 2
different issues discussed:
  1) memory hotplug: remove in kernel auto online for all
                     except of hyperv/xen

       - suggested RFC is not acceptable from virt point of view
         as it regresses guests on top of x86 kvm/vmware which
         both use ACPI based memory hotplug.

       - udev/userspace solution doesn't work in practice as it's
         too slow and unreliable when system is under load which
         is quite common in virt usecase. That's why auto online
         has been introduced in the first place.

  2) memory unplug: online memory as movable

       - doesn't work currently with udev rule due to kernel
         issues https://bugzilla.redhat.com/show_bug.cgi?id=1314306#c7

       - could be fixed both for in kernel auto online and udev
         with following patch:
         https://bugzilla.redhat.com/attachment.cgi?id=1146332
         but fixing it this way exposes zone disbalance issues,
         which are not present in current kernel as blocks are
         onlined in Zone Normal. So this is area to work and
         improve on.

       - currently if one wants to use online_movable,
         one has to either
           * disable auto online in kernel OR
           * remove udev rule that distro ships
         AND write custom daemon that will be able to online
         block in right zone/order. So currently whole
         online_movable thing isn't working by default
         regardless of who onlines memory.

         I'm in favor of implementing that in kernel as it keeps
         kernel internals inside kernel and doesn't need
         kernel API to be involved (memory blocks in sysfs,
         online_kernel, online_movable)
         There would be no need in userspace which would have to
         deal with kernel zoo and maintain that as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help