Re: [v6 2/4] mm, oom: cgroup-aware OOM killer
From: Michal Hocko <mhocko@kernel.org>
Date: 2017-08-24 11:47:13
Also in:
linux-mm, lkml
This doesn't apply on top of mmotm cleanly. You are missing http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org On Wed 23-08-17 17:51:59, Roman Gushchin wrote:
Traditionally, the OOM killer is operating on a process level. Under oom conditions, it finds a process with the highest oom score and kills it. This behavior doesn't suit well the system with many running containers: 1) There is no fairness between containers. A small container with few large processes will be chosen over a large one with huge number of small processes. 2) Containers often do not expect that some random process inside will be killed. In many cases much safer behavior is to kill all tasks in the container. Traditionally, this was implemented in userspace, but doing it in the kernel has some advantages, especially in a case of a system-wide OOM. 3) Per-process oom_score_adj affects global OOM, so it's a breache in the isolation.
Please explain more. I guess you mean that an untrusted memcg could hide
itself from the global OOM killer by reducing the oom scores? Well you
need CAP_SYS_RESOURCE do reduce the current oom_score{_adj} as David has
already pointed out. I also agree that we absolutely must not kill an
oom disabled task. I am pretty sure somebody is using OOM_SCORE_ADJ_MIN
as a protection from an untrusted SIGKILL and inconsistent state as a
result. Those applications simply shouldn't behave differently in the
global and container contexts.
If nothing else we have to skip OOM_SCORE_ADJ_MIN tasks during the kill.
To address these issues, cgroup-aware OOM killer is introduced. Under OOM conditions, it tries to find the biggest memory consumer, and free memory by killing corresponding task(s). The difference the "traditional" OOM killer is that it can treat memory cgroups as memory consumers as well as single processes. By default, it will look for the biggest leaf cgroup, and kill the largest task inside.
Why? I believe that the semantic should be as simple as kill the largest oom killable entity. And the entity is either a process or a memcg which is marked that way. Why should we mix things and select a memcg to kill a process inside it? More on that below.
But a user can change this behavior by enabling the per-cgroup oom_kill_all_tasks option. If set, it causes the OOM killer treat the whole cgroup as an indivisible memory consumer. In case if it's selected as on OOM victim, all belonging tasks will be killed. Tasks in the root cgroup are treated as independent memory consumers, and are compared with other memory consumers (e.g. leaf cgroups). The root cgroup doesn't support the oom_kill_all_tasks feature.
If anything you wouldn't have to treat the root memcg any special. It will be like any other memcg which doesn't have oom_kill_all_tasks... [...]
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask)
+{
+ long points = 0;
+ int nid;
+ pg_data_t *pgdat;
+
+ for_each_node_state(nid, N_MEMORY) {
+ if (nodemask && !node_isset(nid, *nodemask))
+ continue;
+
+ points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+ LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
+
+ pgdat = NODE_DATA(nid);
+ points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+ NR_SLAB_UNRECLAIMABLE);
+ }
+
+ points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+ (PAGE_SIZE / 1024);
+ points += memcg_page_state(memcg, MEMCG_SOCK);
+ points += memcg_page_state(memcg, MEMCG_SWAP);
+
+ return points;I guess I have asked already and we haven't reached any consensus. I do not like how you treat memcgs and tasks differently. Why cannot we have a memcg score a sum of all its tasks? How do you want to compare memcg score with tasks score? This just smells like the outcome of a weird semantic that you try to select the largest group I have mentioned above. This is a rather fundamental concern and I believe we should find a consensus on it before going any further. I believe that users shouldn't see any difference in the OOM behavior when memcg v2 is used and there is no kill-all memcg. If there is such a memcg then we should treat only those specially. But you might have really strong usecases which haven't been presented or I've missed their importance. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>