Re: [RFC 0/3] Implementation of cgroup isolation

From: Zhu Yanhai <hidden>
Date: 2011-03-29 13:16:21
Also in: lkml

Michal,
Maybe what we need here is some kind of trade-off?
Let's say a new configuable parameter reserve_limit, for the cgroups
which want to
have some guarantee in the memory resource, we have:

limit_in_bytes > soft_limit > reserve_limit

MEM[limit_in_bytes..soft_limit] are the bytes that I'm willing to contribute
to the others if they are short of memory.

MEM[soft_limit..reserve_limit] are the bytes that I can afford if the others
are still eager for memory after I gave them MEM[limit_in_bytes..soft_limit].

MEM[reserve_limit..0] are the bytes which is a must for me to guarantee QoS.
Nobody is allowed to steal them.

And reserve_limit is 0 by default for the cgroups who don't care about Qos.

Then the reclaim path also needs some changes, i.e, balance_pgdat():
1) call mem_cgroup_soft_limit_reclaim(), if nr_reclaimed is meet, goto finish.
2) shrink the global LRU list, and skip the pages which belong to the cgroup
who have set a reserve_limit. if nr_reclaimed is meet, goto finish.
3) shrink the cgroups who have set a reserve_limit, and leave them with only
the reserve_limit bytes they need. if nr_reclaimed is meet, goto finish.
4) OOM

Does it make sense?

Thanks,
Zhu Yanhai


2011/3/29 Michal Hocko [off-list ref]:

On Tue 29-03-11 18:41:19, KAMEZAWA Hiroyuki wrote:

quoted

On Tue, 29 Mar 2011 10:59:43 +0200
Michal Hocko [off-list ref] wrote:

quoted

On Tue 29-03-11 16:51:17, KAMEZAWA Hiroyuki wrote:

[...]

quoted

My opinions is to enhance softlimit is better.

I will look how softlimit can be enhanced to match the expectations but
I'm kind of suspicious it can handle workloads where heuristics simply
cannot guess that the resident memory is important even though it wasn't
touched for a long time.

I think we recommend mlock() or hugepagefs to pin application's work area
in usual. And mm guyes have did hardwork to work mm better even without
memory cgroup under realisitic workloads.

Agreed. Whenever this approach is possible we recomend the same thing.

quoted

If your worload is realistic but _important_ anonymous memory is swapped out,
it's problem of global VM rather than memcg.

I would disagree with you on that. The important thing is that it can be
defined from many perspectives. One is the kernel which considers long
unused memory as not _that_ important. And it makes a perfect sense for
most workloads.
An important memory for an application can be something that would
considerably increase the latency just because the memory got paged out
(be it swap or the storage) because it contains pre-computed
data that have a big initial costs.
As you can see there is no mention about the time from the application
POV because it can depend on the incoming requests which you cannot
control.

quoted

If you add 'isolate' per process, okay, I'll agree to add isolate per memcg.

What do you mean by isolate per process?

[...]

quoted

OK, I have tried to explain that in one of the (2nd) patch description.
If I move all task from the root group to other group(s) and keep the
primary application in the root group I would achieve some isolation as
well. That is very much true.

Okay, then, current works well.

quoted

But then there is only one such a group.

I can't catch what you mean. you can create limitless cgroup, anywhere.
Can't you ?

This is not about limits. This is about global vs. per-cgroup reclaim
and how much they interact together.

The everything-in-groups approach with the "primary" service in the root
group (or call it unlimited) works just because all the memory activity
(but the primary service) is caped with the limits so the rest of the
memory can be used by the service. Moreover, in order this to work the
limit for other groups would be smaller then the working set of the
primary service.

Even if you created a limitless group for other important service they
would still interact together and if one goes wild the other would
suffer from that.

.........I can't understad what is the problem when global reclaim
runs just because an application wasn't limited ...or memory are
overcomitted.

I am not sure I understand but what I see as a problem is when unrelated
memory activity triggers reclaim and it pushes out the memory of a
process group just because the heuristics done by the reclaim algorithm
do not pick up the right memory - and honestly, no heuristic will fit
all requirements. Isolation can protect from an unrelated activity
without new heuristics.

[...]

quoted

If softlimit (after some improvement) isn't enough, please add some other.

What I think of is

1. need to "guarantee" memory usages in future.
   "first come, first served" is not good for admins.

this is not in scope of these patchsets but I agree that it would be
nice to have this guarantee

quoted

2. need to handle zone memory shortage. Using memory migration
   between zones will be necessary to avoid pageout.

I am not sure I understand.

quoted

3. need a knob to say "please reclaim from my own cgroup rather than
   affecting others (if usage > some(soft)limit)."

Isn't this handled already and enhanced by the per-cgroup background
reclaim patches?

quoted

[...]

quoted

I think you should put tasks in root cgroup to somewhere. It works perfect
against OOM. And if memory are hidden by isolation, OOM will happen easier.

Why do you think that it would happen easier? Isn't it similar (from OOM
POV) as if somebody mlocked that memory?

if global lru scan cannot find victim memory, oom happens.

Yes, but this will happen with mlocked memory as well, right?

Yes, of course.

Anyway, I'll Nack to simple "first come, first served" isolation.
Please implement garantee, which is reliable and admin can use safely.

Isolation is not about future guarantee. It is rather after you have it
you can rely it will stay in unless in-group activity pushes it out.

quoted

mlock() has similar problem, So, I recommend hugetlbfs to customers,
admin can schedule it at boot time.
(the number of users of hugetlbfs is tend to be one app. (oracle))

What if we decide that hugetlbfs won't be pinned into memory in future?

quoted

I'll be absent, tomorrow.

I think you'll come LSF/MM summit and from the schedule, you'll have
a joint session with Ying as "Memcg LRU management and isolation".

I didn't have plans to do a session actively, but I can certainly join
to talk and will be happy to discuss this topic.

quoted

IIUC, "LRU management" is a google's performance improvement topic.

It's ok for me to talk only about 'isolation'  1st in earlier session.
If you want, please ask James to move session and overlay 1st memory
cgroup session. (I think you saw e-mail from James.)

Yeah, I can do that.

Thanks
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help