Re: memcg writeback (was Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.)

From: Ying Han <hidden>
Date: 2012-02-13 18:40:41

On Thu, Feb 9, 2012 at 5:50 AM, Wu Fengguang [off-list ref] wrote:

On Wed, Feb 08, 2012 at 12:54:33PM -0800, Ying Han wrote:

quoted

On Wed, Feb 8, 2012 at 1:31 AM, Wu Fengguang [off-list ref] wrote:

quoted

On Tue, Feb 07, 2012 at 11:55:05PM -0800, Greg Thelen wrote:

quoted

On Fri, Feb 3, 2012 at 1:40 AM, Wu Fengguang [off-list ref] wrote:

quoted

If moving dirty pages out of the memcg to the 20% global dirty pages
pool on page reclaim, the above OOM can be avoided. It does change the
meaning of memory.limit_in_bytes in that the memcg tasks can now
actually consume more pages (up to the shared global 20% dirty limit).

This seems like an easy change, but unfortunately the global 20% pool
has some shortcomings for my needs:

1. the global 20% pool is not moderated.  One cgroup can dominate it
    and deny service to other cgroups.

It is moderated by balance_dirty_pages() -- in terms of dirty ratelimit.
And you have the freedom to control the bandwidth allocation with some
async write I/O controller.

Even though there is no direct control of dirty pages, we can roughly
get it as the side effect of rate control. Given

       ratelimit_cgroup_A = 2 * ratelimit_cgroup_B

There will naturally be more dirty pages for cgroup A to be worked by
the flusher. And the dirty pages will be roughly balanced around

       nr_dirty_cgroup_A = 2 * nr_dirty_cgroup_B

when writeout bandwidths for their dirty pages are equal.

quoted

2. the global 20% pool is free, unaccounted memory.  Ideally cgroups only
    use the amount of memory specified in their memory.limit_in_bytes.  The
    goal is to sell portions of a system.  Global resource like the 20% are an
    undesirable system-wide tax that's shared by jobs that may not even
    perform buffered writes.

Right, it is the shortcoming.

quoted

3. Setting aside 20% extra memory for system wide dirty buffers is a lot of
    memory.  This becomes a larger issue when the global dirty_ratio is
    higher than 20%.

Yeah the global pool scheme does mean that you'd better allocate at
most 80% memory to individual memory cgroups, otherwise it's possible
for a tiny memcg doing dd writes to push dirty pages to global LRU and
*squeeze* the size of other memcgs.

However I guess it should be mitigated by the fact that

- we typically already reserve some space for the root memcg

Can you give more details on that? AFAIK, we don't treat root cgroup
differently than other sub-cgroups, except root cgroup doesn't have
limit.

OK. I'd imagine this to be the typical usage for desktop and quite a
few servers: a few cgroups are employed to limit the resource usage
for selected tasks (such as backups, background GUI tasks, cron tasks,
etc.). These systems are still running mainly in the global context.

The use case makes senses, but still not sure about the "reservation
for root" part.

For other tasks not running under cgroups, they runs under global
context as you said. However, there is no memory limit for root cgroup
and it will only trigger global reclaim when running short of memory.
It doesn't sounds like a straight-forward configuration for
environments requires memory isolation badly. The worst part is the
unpredictability, which we don't have control of how much
dirty-and-later-clean pages being leaked to root and stays.

--Ying

quoted

In general, I don't like the idea of shared pool in root for all the
dirty pages.

Imagining a system which has nothing running under root and every
application runs within sub-cgroup. It is easy to track and limit each
cgroup's memory usage, but not the pages being moved to root. We have
been experiencing difficulties of tracking pages being re-parented to
root, and this will make it even harder.

So you want to push memcg allocations to the hardware limits. This is
a worthwhile target for cloud servers that run a number of well
contained jobs.

I guess it can be achieved reasonably well with the global shared
dirty pool.  Let's discuss the two major cases.

1) no change of behavior

For example, when the system memory is divided equally to 10 cgroups
each running 1 dd. In this case, the dirty pages will be contained
within the memcg LRUs. Page reclaim rarely encounters any dirty pages.
There is no moving to the global LRU, so no side effect at all.

2) small memcg squeezing other memcg(s)

When system memory is divided to 1 small memcg A and 1 large memcg B,
each running a dd task. In this case the dirty pages from A will be
moved to the global LRU, and global page reclaims will be triggered.

In the end it will be balanced around

- global LRU: 10% memory (which are A's dirty pages)
- memcg B: 90% memory
- memcg A: a tiny ignorable fraction of memory

Now job B uses 10% less memory than w/o the global dirty pool scheme.
I guess this is bad for some type of jobs.

However my question is, will the typical demand be more flexible?
Something like the "minimal" and "recommended" setup: "this job
requires at least XXX memory and better at YYY memory", rather than
some fixed size memory allocation.

The minimal requirement should be trivially satisfied by adding a
memcg watermark that protects the memcg LRU from being reclaimed
when dropped under it.

Then the cloud server could be configured to

       sum(memcg.limit_in_bytes) / memtotal = 100%
       sum(memcg.minimal_size)   / memtotal < 100% - dirty_ratio

Which makes a simple and flexibly partitioned system.

Thanks,
Fengguang

quoted

- 20% dirty ratio is mostly an overkill for large memory systems.
 It's often enough to hold 10-30s worth of dirty data for them, which
 is 1-3GB for one 100MB/s disk. This is the reason vm.dirty_bytes is
 introduced: someone wants to do some <1% dirty ratio.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help