Thread (51 messages) 51 messages, 13 authors, 2012-11-26

Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

From: Anton Vorontsov <hidden>
Date: 2012-11-15 07:37:33
Also in: linux-mm, lkml

Hi David,

Thanks again for your inspirational comments!

On Wed, Nov 14, 2012 at 07:59:52PM -0800, David Rientjes wrote:
quoted
quoted
I agree that eventfd is the way to go, but I'll also add that this feature 
seems to be implemented at a far too coarse of level.  Memory, and hence 
memory pressure, is constrained by several factors other than just the 
amount of physical RAM which vmpressure_fd is addressing.  What about 
memory pressure caused by cpusets or mempolicies?  (Memcg has its own 
reclaim logic
Yes, sure, and my plan for per-cgroups vmpressure was to just add the same
hooks into cgroups reclaim logic (as far as I understand, we can use the
same scanned/reclaimed ratio + reclaimer priority to determine the
pressure).
[Answers reordered]
Rather, I think it's much better to be notified when an individual process 
invokes various levels of reclaim up to and including the oom killer so 
that we know the context that memory freeing needs to happen (or, 
optionally, the set of processes that could be sacrificed so that this 
higher priority process may allocate memory).
I think I understand what you're saying, and surely it makes sense, but I
don't know how you see this implemented on the API level.

Getting struct {pid, pressure} pairs that cause the pressure at the
moment? And the monitor only gets <pids> that are in the same cpuset? How
about memcg limits?..

[...]
quoted
But we still want the "global vmpressure" thing, so that we could use it
without cgroups too. How to do it -- syscall or sysfs+eventfd doesn't
matter much (in the sense that I can do eventfd thing if you folks like it
:).
Most processes aren't going to care if they are running into memory 
pressure and have no implementation to free memory back to the kernel or 
start ratelimiting themselves.  They will just continue happily along 
until they get the memory they want or they get oom killed.  The ones that 
do, however, or a job scheduler or monitor that is watching over the 
memory usage of a set of tasks, will be able to do something when 
notified.
Yup, this is exactly how we want to use this. In Android we have "Activity
Manager" thing, which acts exactly how you describe: it's a tasks monitor.
In the hopes of a single API that can do all this and not a 
reimplementation for various types of memory limitations (it seems like 
what you're suggesting is at least three different APIs: system-wide via 
vmpressure_fd, memcg via memcg thresholds, and cpusets through an eventual 
cpuset threshold), I'm hoping that we can have a single interface that can 
be polled on to determine when individual processes are encountering 
memory pressure.  And if I'm not running in your oom cpuset, I don't care 
about your memory pressure.
I'm not sure to what exactly you are opposing. :) You don't want to have
three "kinds" pressures, or you don't what to have three different
interfaces to each of them, or both?
I don't understand, how would this work with cpusets, for example, with 
vmpressure_fd as defined?  The cpuset policy is embedded in the page 
allocator and skips over zones that are not allowed when trying to find a 
page of the specified order.  Imagine a cpuset bound to a single node that 
is under severe memory pressure.  The reclaim logic will get triggered and 
cause a notification on your fd when the rest of the system's nodes may 
have tons of memory available.
Yes, I see your point: we have many ways to limit resources, so it makes
it hard to identify the cause of the "pressure" and thus how to deal with
it, since the pressure might be caused by different kinds of limits, and
freeing memory from one bucket doesn't mean that the memory will be
available to the process that is requesting the memory.

So we do want to know whether a specific cpuset is under pressure, whether
a specific memcg is under pressure, or whether the system (and kernel
itself) lacks memory.

And we want to have a single API for this? Heh. :)

The other idea might be this (I'm describing it in detail so that you
could actually comment on what exactly you don't like in this):

1. Obtain the fd via eventfd();

2. The fd can be passed to these files:

   I) Say /sys/kernel/mm/memory_pressure

      If we don't use cpusets/memcg or even have CGROUPS=n, this will be
      system's/global memory pressure. Pass the fd to this file and start
      polling.

      If we do use cpusets or memcg, the API will still work, but we have
      two options for its behaviour:

      a) This will only report the pressure when we're reclaiming with
         say (global_reclaim() && node_isset(zone_to_nid(zone),
         current->mems_allowed)) == 1. (Basically, we want to see pressure
         of kernel slabs allocations or any non-soft limits).

      or

      b) If 'filtering' cpusets/memcg seems too hard, we can say that
         these notifications are the "sum" of global+memcg+cpuset. It
         doesn't make sense to actually monitor these, though, so if the
         monitor is aware of cgroups, just 'goto II) and/or III)'.

   II) /sys/fs/cgroup/cpuset/.../cpuset.memory_pressure (yeah, we have
       it already)

      Pass the fd to this file to monitor per-cpuset pressure. So, if you
      get the pressure from here, it makes sense to free resources from
      this cpuset.

   III) /sys/fs/cgroup/memory/.../memory.pressure

      Pass the fd to this file to monitor per-memcg pressure. If you get
      the pressure from here, it only makes sense to free resources from
      this memcg.

3. The pressure level values (and their meaning) and the format of the
   files are the same, and this what defines the "API".

   So, if "memory monitor/supervisor app" is aware of cpusets, it manages
   memory at this level. If both cpuset and memcg is used, then it has to
   monitor both files, and act accordingly. And if we don't use
   cpusets/memcg (or even have cgroups=n), we can just watch the global
   reclaimer's pressure.

Do I understand correctly that you don't like this? Just to make sure. :)

Thanks,
Anton.
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help