Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
From: Michal Hocko <hidden>
Date: 2021-10-19 14:13:29
Also in:
linux-mm, lkml
On Tue 19-10-21 16:26:50, Vasily Averin wrote:
On 19.10.2021 15:04, Michal Hocko wrote:quoted
On Tue 19-10-21 13:54:42, Michal Hocko wrote:quoted
On Tue 19-10-21 13:30:06, Vasily Averin wrote:quoted
On 19.10.2021 11:49, Michal Hocko wrote:quoted
On Tue 19-10-21 09:30:18, Vasily Averin wrote: [...]quoted
With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: a) due to fatal signal b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.quoted
How is b) possible without current being killed? Do we allow remote charging?out_of_memory for memcg_oom select_bad_process mem_cgroup_scan_tasks oom_evaluate_task oom_badness /* * Do not even consider tasks which are explicitly marked oom * unkillable or have been already oom reaped or the are in * the middle of vfork */ adj = (long)p->signal->oom_score_adj; if (adj == OOM_SCORE_ADJ_MIN || test_bit(MMF_OOM_SKIP, &p->mm->flags) || in_vfork(p)) { task_unlock(p); return LONG_MIN; } This time we handle userspace page fault, so we cannot be kenrel thread, and cannot be in_vfork(). However task can be marked as oom unkillable, i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MINYou are right. I am not sure there is a way out of this though. The task can only retry for ever in this case. There is nothing actionable here. We cannot kill the task and there is no other way to release the memory.Btw. don't we force the charge in that case?We should force charge for allocation from inside page fault handler, to prevent endless cycle in retried page faults. However we should not do it for allocations from task context, to prevent memcg-limited vmalloc-eaters from to consume all host memory.
I don't see a big difference between those two. Because the #PF could
result into the very same situation depleting all the memory by
overcharging. A different behavior just leads to a confusion and
unexpected behavior. E.g. in the past we only triggered memcg OOM killer
from the #PF path and failed the charge otherwise. That is something
different but it shows problems we haven't anticipated and had user
visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back
to the charge path").
quoted hunk ↗ jump to hunk
Also I would like to return to the following hunk.@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * A few threads which were not waiting at mutex_lock_killable() can * fail to bail out. Therefore, check again after holding oom_lock. */ - ret = should_force_charge() || out_of_memory(&oc); + ret = task_is_dying() || out_of_memory(&oc); unlock: mutex_unlock(&oom_lock);Now I think it's better to keep task_is_dying() check here. if task is dying, it is not necessary to push other task to free the memory. We broke vmalloc cycle already, so it looks like nothing should prevent us from returning to userspace, handle fatal signal, exit and free the memory.
That patch has to be discuss in its full length. There were other details I have brought up AFAIU. -- Michal Hocko SUSE Labs