Re: [RFC] Add mempressure cgroup

From: Anton Vorontsov <hidden>
Date: 2012-11-29 04:20:54
Also in: lkml

Hello Michal,

Thanks a lot for taking a look into this!

On Wed, Nov 28, 2012 at 05:29:24PM +0100, Michal Hocko wrote:

On Wed 28-11-12 02:29:08, Anton Vorontsov wrote:

quoted

This is an attempt to implement David Rientjes' idea of mempressure
cgroup.

The main characteristics are the same to what I've tried to add to vmevent
API:

  Internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
  pressure index calculation. But we don't expose the index to the
  userland. Instead, there are three levels of the pressure:

  o low (just reclaiming, e.g. caches are draining);
  o medium (allocation cost becomes high, e.g. swapping);
  o oom (about to oom very soon).

  The rationale behind exposing levels and not the raw pressure index
  described here: http://lkml.org/lkml/2012/11/16/675

The API uses standard cgroups eventfd notifications:

  $ gcc Documentation/cgroups/cgroup_event_listener.c -o \
	cgroup_event_listener
  $ cd /sys/fs/cgroup/
  $ mkdir mempressure
  $ mount -t cgroup cgroup ./mempressure -o mempressure
  $ cd mempressure
  $ cgroup_event_listener ./mempressure.level low
  ("low", "medium", "oom" are permitted values.)

  Upon hitting the threshold, you should see "/sys/fs/cgroup/mempressure
  low: crossed" messages.

To test that it actually works on per-cgroup basis, I did a small trick: I
moved all kswapd into a separate cgroup, and hooked the listener onto
another (non-root) cgroup. The listener no longer received global reclaim
pressure, which is expected.

Is this really expected? So you want to be notified only about the
direct reclaim?

I didn't try to put much meaning into assinging a task to a non-global
reclaim watchers, I just mentioned this as an easiest way to test that we
actually can account things on per-thread basis. :)

I am not sure how much useful is that. If you co-mount with e.g. memcg then
the picture is different because even global memory pressure is spread
among groups so it would be just a matter of the proper accounting
(which can be handled similar to lruvec when your code doesn't have to
care about memcg internally).
Co-mounting with cpusets makes sense as well because then you get a
pressure notification based on the placement policy.

So does it make much sense to mount mempressure on its own without
co-mounting with other controllers?

Android does not actually need any of these (memcg or cpusets), but we
still want to get notifications (for a root cgroup would be enough for us
-- but I'm trying to make things generic, of course).

quoted

For a task it is possible to be in both cpusets, memcg and mempressure
cgroups, so by rearranging the tasks it should be possible to watch a
specific pressure.

Could you be more specific what you mean by rearranging? Creating a same
hierarchy? Co-mounting?

quoted

Note that while this adds the cgroups support, the code is well separated
and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
But this is another story.

I think it would be nice to follow freezer and split this into 2 files.
Generic and cgroup spefici.

Yeah, this is surely an option, but so far it's only a few hundrends lines
of code, plus we don't have any other users for the "internals". So, for
the time being, I'd rather keep it in one file.

quoted

Signed-off-by: Anton Vorontsov <redacted>
---

[...]

quoted

+/* These are defaults. Might make them configurable one day. */
+static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;

I realize this is just an RFC but could you be more specific what is the
meaning of vmpressure_win?

Sure, let me just copy the text from the previous RFC, to which you were
not Cc'ed:

 When the system is short on idle pages, the new memory is allocated by
 reclaiming least recently used resources: kernel scans pages to be
 reclaimed (e.g. from file caches, mmap(2) volatile ranges, etc.; and
 potentially swapping some pages out). The index shows the relative time
 spent by the kernel uselessly scanning pages, or, in other words, the
 percentage of scans of pages (vmpressure_window) that were not reclaimed.
 ...
 Window size is used as a rate-limit tunable for VMPRESSURE_LOW
 notifications and for averaging for VMPRESSURE_{MEDIUM,OOM} levels. So,
 using small window sizes can cause lot of false positives for _MEDIUM and
 _OOM levels, but too big window size may delay notifications. By default
 the window size equals to 256 pages (1MB).

You can find more about the tunables in the previus RFC:

	http://lkml.org/lkml/2012/11/7/169

quoted

+static const uint vmpressure_level_med = 60;
+static const uint vmpressure_level_oom = 99;
+static const uint vmpressure_level_oom_prio = 4;
+
+enum vmpressure_levels {
+	VMPRESSURE_LOW = 0,
+	VMPRESSURE_MEDIUM,
+	VMPRESSURE_OOM,
+	VMPRESSURE_NUM_LEVELS,
+};
+
+static const char const *vmpressure_str_levels[] = {
+	[VMPRESSURE_LOW] = "low",
+	[VMPRESSURE_MEDIUM] = "medium",
+	[VMPRESSURE_OOM] = "oom",
+};
+
+static enum vmpressure_levels vmpressure_level(uint pressure)
+{
+	if (pressure >= vmpressure_level_oom)
+		return VMPRESSURE_OOM;
+	else if (pressure >= vmpressure_level_med)
+		return VMPRESSURE_MEDIUM;
+	return VMPRESSURE_LOW;
+}
+
+static ulong vmpressure_calc_level(uint win, uint s, uint r)
+{
+	ulong p;
+
+	if (!s)
+		return 0;
+
+	/*
+	 * We calculate the ratio (in percents) of how many pages were
+	 * scanned vs. reclaimed in a given time frame (window). Note that
+	 * time is in VM reclaimer's "ticks", i.e. number of pages
+	 * scanned. This makes it possible to set desired reaction time
+	 * and serves as a ratelimit.
+	 */
+	p = win - (r * win / s);
+	p = p * 100 / win;

Do we need the win at all?
	p = 100 - (100 * r / s);

Other than for me being pedant, pretty much no. :) It's just less
"precise" (try s=1000, r=9). (But in return, my version is prone to
misbehave when window is too large.)

quoted

+
+	pr_debug("%s: %3lu  (s: %6u  r: %6u)\n", __func__, p, s, r);
+
+	return vmpressure_level(p);
+}
+

[...]

quoted

+static int mpc_pre_destroy(struct cgroup *cg)
+{
+	struct mpc_state *mpc = cg2mpc(cg);
+	int ret = 0;
+
+	mutex_lock(&mpc->lock);
+
+	if (mpc->eventfd)
+		ret = -EBUSY;

The current cgroup's core doesn't allow pre_destroy to fail anymore. The
code is marked for 3.8

Sure, I can rebase. (Currently, the code is based on the v3.7-rc6, which
isn't even released but seems way too old already, heh. :)

[...]

quoted

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48550c6..430d8a5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c

@@ -1877,6 +1877,8 @@ restart:
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
 
+	vmpressure(sc->nr_scanned - nr_scanned, nr_reclaimed);
+

I think this should already report to a proper group otherwise all the
global reclaim would go to a group where kswapd sits rather than to the
target group as I mentioned above (so it at least wouldn't work with a
co-mounted cases).

Um. Yeah, I guess I was too optimistic here, relying on the things to
"just work". I guess I still need to pass memcg pointer to the
vmpressure() and check if the process is also part of the
sc->target_mem_cgroup.

quoted

 	/* reclaim/compaction might need reclaim to continue */
 	if (should_continue_reclaim(lruvec, nr_reclaimed,
 				    sc->nr_scanned - nr_scanned, sc))

@@ -2099,6 +2101,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		count_vm_event(ALLOCSTALL);
 
 	do {
+		vmpressure_prio(sc->priority);

Shouldn't this go into shrink_lruvec or somewhere at that level to catch
also kswapd low priorities? If you insist on the direct reclaim then you
should hook into __zone_reclaim as well.

Probably... Thanks for pointing out, I'll take a closer look once we
resolve the global/design issues.

Thanks!
Anton.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help