Thread (51 messages) 51 messages, 13 authors, 2012-11-26

Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

From: Glauber Costa <hidden>
Date: 2012-11-16 09:33:58
Also in: linux-mm, lkml

On 11/16/2012 01:25 AM, David Rientjes wrote:
On Thu, 15 Nov 2012, Anton Vorontsov wrote:
quoted
Hehe, you're saying that we have to have cgroups=y. :) But some folks were
deliberately asking us to make the cgroups optional.
Enabling just CONFIG_CGROUPS (which is enabled by default) and no other 
current cgroups increases the size of the kernel text by less than 0.3% 
with x86_64 defconfig:

   text	   data	    bss	    dec	    hex	filename
10330039	1038912	1118208	12487159	 be89f7	vmlinux.disabled
10360993	1041624	1122304	12524921	 bf1d79	vmlinux.enabled

I understand that users with minimally-enabled configs for an optimized 
memory footprint will have a higher percentage because their kernel is 
already smaller (~1.8% increase for allnoconfig), but I think the cost of 
enabling the cgroups code to be able to mount a vmpressure cgroup (which 
I'd rename to be "mempressure" to be consistent with "memcg" but it's only 
an opinion) is relatively small and allows for a much more maintainable 
and extendable feature to be included: it already provides the 
cgroup.event_control interface that supports eventfd that makes 
implementation much easier.  It also makes writing a library on top of the 
cgroup to be much easier because of the standardization.

I'm more concerned about what to do with the memcg memory thresholds and 
whether they can be replaced with this new cgroup.  If so, then we'll have 
to figure out how to map those triggers to use the new cgroup's interface 
in a way that doesn't break current users that open and pass the fd of 
memory.usage_in_bytes to cgroup.event_control for memcg.
quoted
OK, here is what I can try to do:

- Implement memory pressure cgroup as you described, by doing so we'd make
  the thing play well with cpusets and memcg;

- This will be eventfd()-based;
Should be based on cgroup.event_control, see how memcg interfaces its 
memory thresholds with this in Documentation/cgroups/memory.txt.
quoted
- Once done, we will have a solution for pretty much every major use-case
  (i.e. servers, desktops and Android, they all have cgroups enabled);
Excellent!  I'd be interested in hearing anybody else's opinions, 
especially those from the memcg world, so we make sure that everybody is 
happy with the API that you've described.
Just CC'd them all.

My personal take:

Most people hate memcg due to the cost it imposes. I've already
demonstrated that with some effort, it doesn't necessarily have to be
so. (http://lwn.net/Articles/517634/)

The one thing I missed on that work, was precisely notifications. If you
can come up with a good notifications scheme that *lives* in memcg, but
does not *depend* in the memcg infrastructure, I personally think it
could be a big win.

Doing this in memcg has the advantage that the "per-group" vs "global"
is automatically solved, since the root memcg is just another name for
"global".

I honestly like your low/high/oom scheme better than memcg's
"threshold-in-bytes". I would also point out that those thresholds are
*far* from exact, due to the stock charging mechanism, and can be wrong
by as much as O(#cpus). So far, nobody complained. So in theory it
should be possible to convert memcg to low/high/oom, while still
accepting writes in bytes, that would be thrown in the closest bucket.

Another thing from one of your e-mails, that may shift you in the memcg
direction:

"2. The last time I checked, cgroups memory controller did not (and I
guess still does not) not account kernel-owned slabs. I asked several
times why so, but nobody answered."

It should, now, in the latest -mm, although it won't do per-group
reclaim (yet).

I am also failing to see how cpusets would be involved in here. I
understand that you may have free memory in terms of size, but still be
further restricted by cpuset. But I also think that having multiple
entry points for this buy us nothing at all. So the choices I see are:

1) If cpuset + memcg are comounted, take this into account when deciding
low / high / oom. This is yet another advantage over the "threshold in
bytes" interface, in which you can transparently take
other issues into account while keeping the interface.

2) If they are not, just ignore this effect.

The fallback in 2) sounds harsh, but I honestly think this is the price
to pay for the insanity of mounting those things in different
hierarchies, and we do have a plan to have all those things eventually
together anyway. If you have two cgroups dealing with memory, and set
them up in orthogonal ways, I really can't see how we can bring sanity
to that. So just admitting and unleashing the insanity may be better, if
it brings up our urge to fix it. It worked for Batman, why wouldn't it
work for us?







--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help