Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection

From: Johannes Weiner <hidden>
Date: 2020-02-27 15:06:27
Also in: linux-mm, lkml

On Thu, Feb 27, 2020 at 02:35:44PM +0100, Michal Koutný wrote:

On Wed, Feb 26, 2020 at 10:05:48AM -0500, Johannes Weiner [off-list ref] wrote:

quoted

I don't see a fundamental difference between them. And that in turn
makes it hard for me to accept that hierarchical inheritance rules
should be different.

I'll try coming up with some better examples for the difference that I
perceive.

quoted

"Wrong" isn't the right term. Is it what you wanted to express in your
configuration?

I want to express absolute amount of memory (ideally representing
workingset size) under protection.

IIUC, you want to express general relative priorities of B vs C when
some outer metric has to be maintained given you reach both limits of
memory and IO.

It's been our experience that it's basically impossible to control for
memory without having it result in IO contention.

You acknowledge below that this effect may be noticable in some
situations. It's been our experience, however, that this effect is so
pronounced over a wide variety of workloads and host configurations
that exclusive memory control is not a practical application for
anything but niche cases - if they exist at all.

quoted

You are talking about a mathematical truth on a per-controller
basis. What I'm saying is that I don't see how this is useful for real
workloads, their relative priorities, and the performance expectations
users have from these priorities.

quoted

With a priority inversion like this, there is no actual performance
isolation or containerization going on here - which is the whole point
of cgroups and resource control.

I acknowledge that by pressing too much along one dimension (memory) you
induce expansion in other dimension (IO) and that may become noticable in
siblings (expansion over saturation [1]). But that's expected when only
weights are in use. If you wanted to hide the effect of workload B to C,
B would need real limit.

[I beg to disagree that containerization is whole point of cgroups, it's
large part of it, hence the isolation needn't be necessarily
bi-directional.]

I said "isolation or containerization", and it really isn't a stretch
to see how the the intended isolation can break down in this example.

You could set an IO limit on the scape goat to keep it from inheriting
the higher IO priority from its parent.

But you could also just set a memory limit on the scape goat to keep
it from inheriting the higher memory allowance from the parent.

Between all this, I really don't see an argument here to make the
memory hierarchy semantics different from the other controllers.

quoted

My objection is to opting out of protection against cousins (thus
overriding parental resource assignment), not against siblings.

Just to sync up the terminology - I'm calling this protection against
uncles (the composition/structure under them is irrelevant).
And the limitation comes from grandparent or higher (or global).

Yes, either way works.

...and the overriden parental resource assignment is the expansion on
non-memory dimension (IO/CPU).

quoted

Correct, but you can change the tree to this:

     A.low=10G
     `- A1.low=10G
        `- B.low=0G
        `- C.low=0G
     `- D.low=0G

to express

A1 > D
 B = C

That sort of works (if I give up the scapegoat). Although I have trouble
that I have to copy the value from A to A1, I could have done that with
previous hierarchy and simply set B.low=C.low=10G.

D is still the scape goat for B and C..?

quoted

That is, I would like to see an argument for this setup:

     A				
     `- B		io.weight=200          memory.low=10G
        `- D		io.weight=100 (e.g.)   memory.low=10G
        `- E		io.weight=100 (e.g.)   memory.low=0
     `- C		io.weight=50           memory.low=5G

Where E has no memory protection against C, but E has IO priority over
C. That's the configuration that cannot be expressed with a recursive
memory.low, but since it involves priority inversions it's not useful
to actually isolate and containerize workloads.

But there can be no cousin (uncle) or more precisely it's the global
rest that we don't mind to affect.

Okay, hold on.

You wouldn't care about starving the rest of the system of IO and
CPU. But the objection to my patch is that you want to give memory
back to avoid undue burden on the rest of the system?

Can we please stop talking about such contrived hypotheticals and
discuss real computer systems that real people actually care about?

quoted

I'd say that protected memory is a disposable resource in contrast with
CPU/IO. If you don't have latter, you don't progress; if you lack the
former, you are refaulting but can make progress. Even more, you should
be able to give up memory.min.

Eh, I'm not buying that. You cannot run without memory either. If
somebody reclaims a page between you faulting it in and you resuming
to userspace, there is no forward progress.

I made a hasty argument (misinterpretting the constant outer reclaim
pressure). So that wasn't the fundamental difference.

The second part -- memory.min is subject to equal calculation as
memory.low. Do you find the scape goat preventing OOM in grand-parent
(or higher) subtree also a misfeature/artifact?

What about CPU and IO?

If you knew exactly that the scape goat doesn't need the memory, you
could set a memory limit on it - just like you could set a limit on
CPU and IO cycles to "give back" resources from inside a tree.

If you don't know exactly how much of the scape goat's memory is and
isn't needed, the additional paging risk from getting it wrong would
be to the detriment of both your workload and the rest of the system -
your attempt to be good to the rest of the system suddenly turns into
a negative-sum game.

I fundamentally do not understand the practical application of the
configuration you are arguing tooth and nail needs to be supported.

If this is a dealbreaker, surely in a month of discussion and 30+
emails, it should have been possible to come up with *one* example of
a real workload and host configuration for which the ability to
dissent from the hierarchical memory allocation (but oddly, not other
resources) is the *only* way to express working resource isolation.

As it stands, I have provided examples of real workloads and host
configs that can't be expressed with the current semantics. As such, I
would like to move ahead with my changes. They are gated behind a
mount option, so pose no risk to the elusive setups you envision. We
can always implement the inheritance scheme you propose once we have
concrete examples of real life scenarios that aren't otherwise doable,
but there is certainly not enough evidence to make me implement it now
as a condition for merging my patches.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help