Re: [v6 2/4] mm, oom: cgroup-aware OOM killer
From: Michal Hocko <mhocko@kernel.org>
Date: 2017-08-24 12:58:17
Also in:
linux-mm, lkml
On Thu 24-08-17 13:28:46, Roman Gushchin wrote:
Hi Michal! On Thu, Aug 24, 2017 at 01:47:06PM +0200, Michal Hocko wrote:quoted
This doesn't apply on top of mmotm cleanly. You are missing http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.orgI'll rebase. Thanks!quoted
On Wed 23-08-17 17:51:59, Roman Gushchin wrote:quoted
Traditionally, the OOM killer is operating on a process level. Under oom conditions, it finds a process with the highest oom score and kills it. This behavior doesn't suit well the system with many running containers: 1) There is no fairness between containers. A small container with few large processes will be chosen over a large one with huge number of small processes. 2) Containers often do not expect that some random process inside will be killed. In many cases much safer behavior is to kill all tasks in the container. Traditionally, this was implemented in userspace, but doing it in the kernel has some advantages, especially in a case of a system-wide OOM. 3) Per-process oom_score_adj affects global OOM, so it's a breache in the isolation.Please explain more. I guess you mean that an untrusted memcg could hide itself from the global OOM killer by reducing the oom scores? Well you need CAP_SYS_RESOURCE do reduce the current oom_score{_adj} as David has already pointed out. I also agree that we absolutely must not kill an oom disabled task. I am pretty sure somebody is using OOM_SCORE_ADJ_MIN as a protection from an untrusted SIGKILL and inconsistent state as a result. Those applications simply shouldn't behave differently in the global and container contexts.The main point of the kill_all option is to clean up the victim cgroup _completely_. If some tasks can survive, that means userspace should take care of them, look at the cgroup after oom, and kill the survivors manually. If you want to rely on OOM_SCORE_ADJ_MIN, don't set kill_all. I really don't get the usecase for this "kill all, except this and that".
OOM_SCORE_ADJ_MIN has become a contract de-facto. You cannot simply expect that somebody would alter a specific workload for a container just to be safe against unexpected SIGKILL. kill-all might be set up the memcg hierarchy which is out of your control.
Also, it's really confusing to respect -1000 value, and completely ignore -999. I believe that any complex userspace OOM handling should use memory.high and handle memory shortage before an actual OOM.quoted
If nothing else we have to skip OOM_SCORE_ADJ_MIN tasks during the kill.quoted
To address these issues, cgroup-aware OOM killer is introduced. Under OOM conditions, it tries to find the biggest memory consumer, and free memory by killing corresponding task(s). The difference the "traditional" OOM killer is that it can treat memory cgroups as memory consumers as well as single processes. By default, it will look for the biggest leaf cgroup, and kill the largest task inside.Why? I believe that the semantic should be as simple as kill the largest oom killable entity. And the entity is either a process or a memcg which is marked that way.So, you still need to compare memcgroups and processes. In my case, it's more like an exception (only processes from root memcg, and only if there are no eligible cgroups with lower oom_priority). You suggest to rely on this comparison.quoted
Why should we mix things and select a memcg to kill a process inside it? More on that below.To have some sort of "fairness" in a containerized environemnt. Say, 1 cgroup with 1 big task, another cgroup with many smaller tasks. It's not necessary true, that first one is a better victim.
There is nothing like a "better victim". We are pretty much in a catastrophic situation when we try to survive by killing a userspace. We try to kill the largest because that assumes that we return the most memory from it. Now I do understand that you want to treat the memcg as a single killable entity but I find it really questionable to do a per-memcg metric and then do not treat it like that and kill only a single task. Just imagine a single memcg with zillions of taks each very small and you select it as the largest while a small taks itself doesn't help to help to get us out of the OOM.
quoted
quoted
But a user can change this behavior by enabling the per-cgroup oom_kill_all_tasks option. If set, it causes the OOM killer treat the whole cgroup as an indivisible memory consumer. In case if it's selected as on OOM victim, all belonging tasks will be killed. Tasks in the root cgroup are treated as independent memory consumers, and are compared with other memory consumers (e.g. leaf cgroups). The root cgroup doesn't support the oom_kill_all_tasks feature.If anything you wouldn't have to treat the root memcg any special. It will be like any other memcg which doesn't have oom_kill_all_tasks... [...]quoted
+static long memcg_oom_badness(struct mem_cgroup *memcg, + const nodemask_t *nodemask) +{ + long points = 0; + int nid; + pg_data_t *pgdat; + + for_each_node_state(nid, N_MEMORY) { + if (nodemask && !node_isset(nid, *nodemask)) + continue; + + points += mem_cgroup_node_nr_lru_pages(memcg, nid, + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); + + pgdat = NODE_DATA(nid); + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg), + NR_SLAB_UNRECLAIMABLE); + } + + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) / + (PAGE_SIZE / 1024); + points += memcg_page_state(memcg, MEMCG_SOCK); + points += memcg_page_state(memcg, MEMCG_SWAP); + + return points;I guess I have asked already and we haven't reached any consensus. I do not like how you treat memcgs and tasks differently. Why cannot we have a memcg score a sum of all its tasks?It sounds like a more expensive way to get almost the same with less accuracy. Why it's better?
because then you are comparing apples to apples? Besides that you have to check each task for over-killing anyway. So I do not see any performance merits here.
quoted
How do you want to compare memcg score with tasks score?I have to do it for tasks in root cgroups, but it shouldn't be a common case.
How come? I can easily imagine a setup where only some memcgs which really do need a kill-all semantic while all others can live with single task killed perfectly fine. -- Michal Hocko SUSE Labs