Re: [RFC][PATCH] cgroup_threadgroup_rwsem - affects scalability and OOM
From: Balbir Singh <bsingharora@gmail.com>
Date: 2016-08-09 07:02:54
Also in:
linux-mm
On 09/08/16 16:29, Tejun Heo wrote:
Hello, Balbir. On Tue, Aug 09, 2016 at 02:19:01PM +1000, Balbir Singh wrote:quoted
cgroup_threadgroup_rwsem is acquired in read mode during process exit and fork. It is also grabbed in write mode during __cgroups_proc_write I've recently run into a scenario with lots of memory pressure and OOM and I am beginning to see systemd __switch_to+0x1f8/0x350 __schedule+0x30c/0x990 schedule+0x48/0xc0 percpu_down_write+0x114/0x170 __cgroup_procs_write.isra.12+0xb8/0x3c0 cgroup_file_write+0x74/0x1a0 kernfs_fop_write+0x188/0x200 __vfs_write+0x6c/0xe0 vfs_write+0xc0/0x230 SyS_write+0x6c/0x110 system_call+0x38/0xb4 This thread is waiting on the reader of cgroup_threadgroup_rwsem to exit. The reader itself is under memory pressure and has gone into reclaim after fork. There are times the reader also ends up waiting on oom_lock as well....quoted
copy_page_range+0x4ec/0x950 copy_process.isra.5+0x15a0/0x1870 _do_fork+0xa8/0x4b0 ppc_clone+0x8/0xcYeah, we definitely don't wanna be holding the rwsem during the actual fork. ...quoted
There are other theoretical issues with this semaphore systemd can do 1. cgroup_mutex (cgroup_kn_lock_live) 2. cgroup_threadgroup_rwsem (W) (__cgroup_procs_write) and other threads can go 1. cgroup_threadgroup_rwsem (R) (copy_process) 2. mem_cgroup_iter (as a part of reclaim) (cgroup_mutex -- rcu lock or cgroup_mutex)Hmm? Where does mem_cgroup_iter grab cgroup_mutex? cgroup_mutex nests outside cgroup_threadgroup_rwsem or most other mutexes for that matter and isn't exposed from cgroup core.
I based my theory on the code mem_cgroup_iter -> css_next_descendant_pre which asserts cgroup_assert_mutex_or_rcu_locked(), although you are right, we hold RCU lock while calling css_* routines.
quoted
However, I've not examined them in too much detail or looked at lockdep wait chains for those paths. I am sure there is a good reason for placing cgroup_threadgroup_rwsem where it is today and I might be missing something. I am also surprisedI could be missing something too but the positioning is largely historic.quoted
no-one else has run into it so far.Maybe it might matter that much on a system which is already heavily thrasing, but yeah, we definitely want to tighten down the reader sections so that it doesn't get in the way of making forward progress.
It seems to cause my system to thrash quite badly.
quoted
Comments?The change looks good to me on the first glance but I'll think more about it tomorrow. Thanks!
Thanks for the review. Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>