Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
From: Petr Mladek <pmladek@suse.com>
Date: 2016-12-12 11:49:07
On Mon 2016-12-12 10:07:03, Michal Hocko wrote:
On Sat 10-12-16 20:24:57, Tetsuo Handa wrote:quoted
Michal Hocko wrote:quoted
On Fri 09-12-16 23:23:10, Tetsuo Handa wrote:quoted
Michal Hocko wrote:quoted
On Thu 08-12-16 00:29:26, Tetsuo Handa wrote:quoted
Michal Hocko wrote:quoted
On Tue 06-12-16 19:33:59, Tetsuo Handa wrote:quoted
If the OOM killer is invoked when many threads are looping inside the page allocator, it is possible that the OOM killer is preempted by other threads.Hmm, the only way I can see this would happen is when the task which actually manages to take the lock is not invoking the OOM killer for whatever reason. Is this what happens in your case? Are you able to trigger this reliably?Regarding http://I-love.SAKURA.ne.jp/tmp/serial-20161206.txt.xz , somebody called oom_kill_process() and reached pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", line but did not reach pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", line within tolerable delay.I would be really interested in that. This can happen only if find_lock_task_mm fails. This would mean that either we are selecting a child without mm or the selected victim has no mm anymore. Both cases should be ephemeral because oom_badness will rule those tasks on the next round. So the primary question here is why no other task has hit out_of_memory.This can also happen due to AB-BA livelock (oom_lock v.s. console_sem).Care to explain how would that livelock look like?Two types of threads (Thread-1 which is holding oom_lock, Thread-2 which is not holding oom_lock) are doing memory allocation. Since oom_lock is a mutex, there can be only 1 instance for Thread-1. But there can be multiple instances for Thread-2. (1) Thread-1 enters out_of_memory() because it is holding oom_lock. (2) Thread-1 enters printk() due to pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ...); in oom_kill_process(). (3) vprintk_func() is mapped to vprintk_default() because Thread-1 is not inside NMI handler. (4) In vprintk_emit(), in_sched == false because loglevel for pr_err() is not LOGLEVEL_SCHED. (5) Thread-1 calls log_store() via log_output() from vprintk_emit(). (6) Thread-1 calls console_trylock() because in_sched == false. (7) Thread-1 acquires console_sem via down_trylock_console_sem(). (8) In console_trylock(), console_may_schedule is set to true because Thread-1 is in sleepable context. (9) Thread-1 calls console_unlock() because console_trylock() succeeded. (9) In console_unlock(), pending data stored by log_store() are printed to consoles. Since there may be slow consoles, cond_resched() is called if possible. And since console_may_schedule == true because Thread-1 is in sleepable context, Thread-1 may be scheduled at console_unlock(). (10) Thread-2 tries to acquire oom_lock but it fails because Thread-1 is holding oom_lock. (11) Thread-2 enters warn_alloc() because it is waiting for Thread-1 to return from oom_kill_process(). (12) Thread-2 enters printk() due to warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...); in __alloc_pages_slowpath(). (13) vprintk_func() is mapped to vprintk_default() because Thread-2 is not inside NMI handler. (14) In vprintk_emit(), in_sched == false because loglevel for pr_err() is not LOGLEVEL_SCHED. (15) Thread-2 calls log_store() via log_output() from vprintk_emit(). (16) Thread-2 calls console_trylock() because in_sched == false. (17) Thread-2 fails to acquire console_sem via down_trylock_console_sem(). (18) Thread-2 returns from vprintk_emit(). (19) Thread-2 leaves warn_alloc(). (20) When Thread-1 is waken up, it finds new data appended by Thread-2. (21) Thread-1 remains inside console_unlock() with oom_lock still held because there is data which should be printed to consoles. (22) Thread-2 remains failing to acquire oom_lock, periodically appending new data via warn_alloc(), and failing to acquire oom_lock. (23) The user visible result is that Thread-1 is unable to return from pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ...); in oom_kill_process().OK, I see. This is not a new problem though and people are trying to solve it in the printk proper. CCed some people, I do not have links to those threads handy. And if this is really the problem here then we definitely shouldn't put hacks into the page allocator path to handle it because there might be other sources of the printk flood might be arbitrary.
Yup, this is exactly the type of the problem that we want to solve by the async printk.
quoted
The introduction of uncontrolled warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);
I am just curious that there would be so many messages. If I get it correctly, this warning is printed once every 10 second. Or am I wrong? Well, you might want to consider using stall_timeout *= 2; instead of adding the constant 10 * HZ. Of course, a better would be some global throttling of this message. Best Regards, Petr PS: I am not mm expert and did not read this thread. Just ignore this if I missed the point. Anyway, it sounds weird to linearize all allocation request in OOM situation. It is much harder to unblock a high-order requests than a low-order ones. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>