Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Zefan Li <hidden>
Date: 2020-06-19 06:40:27
On 2020/6/19 5:09, Cong Wang wrote:
On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin [off-list ref] wrote:quoted
On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:quoted
On Wed, Jun 17, 2020 at 6:44 PM Zefan Li [off-list ref] wrote:quoted
Cc: Roman Gushchin <redacted> Thanks for fixing this. On 2020/6/17 2:03, Cong Wang wrote:quoted
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is copied, so the cgroup refcnt must be taken too. And, unlike the sk_alloc() path, sock_update_netprioidx() is not called here. Therefore, it is safe and necessary to grab the cgroup refcnt even when cgroup_sk_alloc is disabled. sk_clone_lock() is in BH context anyway, the in_interrupt() would terminate this function if called there. And for sk_alloc() skcd->val is always zero. So it's safe to factor out the code to make it more readable. Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")but I don't think the bug was introduced by this commit, because there are already calls to cgroup_sk_alloc_disable() in write_priomap() and write_classid(), which can be triggered by writing to ifpriomap or classid in cgroupfs. This commit just made it much easier to happen with systemd invovled. I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"), which added cgroup_bpf_get() in cgroup_sk_alloc().Good point. I take a deeper look, it looks like commit d979a39d7242e06 is the one to blame, because it is the first commit that began to hold cgroup refcnt in cgroup_sk_alloc().I agree, ut seems that the issue is not related to bpf and probably can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed seems closer to the origin.Yeah, I will update the Fixes tag and send V2.
Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
but this is fine, because when the socket is to be freed:
sk_prot_free()
cgroup_sk_free()
cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
but cgroup_bpf_put() will decrement the bpf refcnt while this refcnt were not incremented
as cgroup_sk_alloc has already been disabled. That's why I think it's 4bfc0bb2c60e2f4c
that needs to be fixed.
quoted
Btw, based on the number of reported-by tags it seems that there was a real issue which the patch is fixing. Maybe you'll a couple of words about how it reveals itself in the real life?I still have no idea how exactly this is triggered. According to the people who reported this bug, they just need to wait for some hours to trigger. So I am not sure what to add here, just the stack trace? Thanks. .