Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
From: Michal Koutný <hidden>
Date: 2020-02-27 13:35:53
Also in:
linux-mm, lkml
TL;DR I see merit in the recursive propagation if it's requested explicitly (i.e. retaining meaining of 0). The protection/weight semantics should be refined. On Wed, Feb 26, 2020 at 10:05:48AM -0500, Johannes Weiner [off-list ref] wrote:
They still ultimately translate to real resources. The concrete value depends on what the parent's weight translates to, and it depends on sibling configurations and their current consumption. (All of this is already true for memory protection as well, btw). But eventually, a weight specification translates to actual time on a CPU, bandwidth on an IO device etc.quoted
- sum of sibling weights is meaningless (and independent from parent weight)Technically true for overcommitted memory.low values as well.
Yes, but for overcommited only. For pure weights it doesn't matter if you set 1:10, 10:100 or 100:1000, however, for the protection it has this behavior only when approaching infinity and as the sum compares to parent's value, the protection behaves differently. [If there had to be to some pure memory weights, those would for instance express relative affinity of group's pages to physical memory.]
I don't see a fundamental difference between them. And that in turn makes it hard for me to accept that hierarchical inheritance rules should be different.
I'll try coming up with some better examples for the difference that I perceive.
"Wrong" isn't the right term. Is it what you wanted to express in your configuration?
I want to express absolute amount of memory (ideally representing workingset size) under protection. IIUC, you want to express general relative priorities of B vs C when some outer metric has to be maintained given you reach both limits of memory and IO.
You are talking about a mathematical truth on a per-controller basis. What I'm saying is that I don't see how this is useful for real workloads, their relative priorities, and the performance expectations users have from these priorities.
With a priority inversion like this, there is no actual performance isolation or containerization going on here - which is the whole point of cgroups and resource control.
I acknowledge that by pressing too much along one dimension (memory) you induce expansion in other dimension (IO) and that may become noticable in siblings (expansion over saturation [1]). But that's expected when only weights are in use. If you wanted to hide the effect of workload B to C, B would need real limit. [I beg to disagree that containerization is whole point of cgroups, it's large part of it, hence the isolation needn't be necessarily bi-directional.]
My objection is to opting out of protection against cousins (thus overriding parental resource assignment), not against siblings.
Just to sync up the terminology - I'm calling this protection against uncles (the composition/structure under them is irrelevant). And the limitation comes from grandparent or higher (or global). ...and the overriden parental resource assignment is the expansion on non-memory dimension (IO/CPU).
Correct, but you can change the tree to this:
A.low=10G
`- A1.low=10G
`- B.low=0G
`- C.low=0G
`- D.low=0G
to express
A1 > D
B = CThat sort of works (if I give up the scapegoat). Although I have trouble that I have to copy the value from A to A1, I could have done that with previous hierarchy and simply set B.low=C.low=10G.
That is, I would like to see an argument for this setup:
A
`- B io.weight=200 memory.low=10G
`- D io.weight=100 (e.g.) memory.low=10G
`- E io.weight=100 (e.g.) memory.low=0
`- C io.weight=50 memory.low=5G
Where E has no memory protection against C, but E has IO priority over
C. That's the configuration that cannot be expressed with a recursive
memory.low, but since it involves priority inversions it's not useful
to actually isolate and containerize workloads.But there can be no cousin (uncle) or more precisely it's the global rest that we don't mind to affect.
quoted
I'd say that protected memory is a disposable resource in contrast with CPU/IO. If you don't have latter, you don't progress; if you lack the former, you are refaulting but can make progress. Even more, you should be able to give up memory.min.Eh, I'm not buying that. You cannot run without memory either. If somebody reclaims a page between you faulting it in and you resuming to userspace, there is no forward progress.
I made a hasty argument (misinterpretting the constant outer reclaim pressure). So that wasn't the fundamental difference. The second part -- memory.min is subject to equal calculation as memory.low. Do you find the scape goat preventing OOM in grand-parent (or higher) subtree also a misfeature/artifact? Thanks, Michal [1] Here I'm taking your/Tejun's assumption that in the stressful situations it always boils down to IO, although I don't have any quantitative arguments for that.
Attachments
- signature.asc [application/pgp-signature] 833 bytes