Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

From: Zhang,Qiang <hidden>
Date: 2020-06-23 09:01:55
On Mon, Jun 22, 2020 at 11:14:20AM -0700, Cong Wang wrote:
 > On Sat, Jun 20, 2020 at 8:58 AM Roman Gushchin [off-list ref] wrote:
 > >
 > > On Fri, Jun 19, 2020 at 08:00:41PM -0700, Cong Wang wrote:
 > > > On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin [off-list ref] wrote:
 > > > >
 > > > > On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
 > > > > > I think so, though I'm not familiar with the bfp cgroup code.
 > > > > >
 > > > > > > If so, we might wanna fix it in a different way,
 > > > > > > just checking if (!(css->flags & CSS_NO_REF)) in 
cgroup_bpf_put()
 > > > > > > like in cgroup_put(). It feels more reliable to me.
 > > > > > >
 > > > > >
 > > > > > Yeah I also have this idea in my mind.
 > > > >
 > > > > I wonder if the following patch will fix the issue?
 > > >
 > > > Interesting, AFAIU, this refcnt is for bpf programs attached
 > > > to the cgroup. By this suggestion, do you mean the root
 > > > cgroup does not need to refcnt the bpf programs attached
 > > > to it? This seems odd, as I don't see how root is different
 > > > from others in terms of bpf programs which can be attached
 > > > and detached in the same way.
 > > >
 > > > I certainly understand the root cgroup is never gone, but this
 > > > does not mean the bpf programs attached to it too.
 > > >
 > > > What am I missing?
 > >
 > > It's different because the root cgroup can't be deleted.
 > >
 > > All this reference counting is required to automatically detach bpf 
programs
 > > from a _deleted_ cgroup (look at cgroup_bpf_offline()). It's required
 > > because a cgroup can be in dying state for a long time being pinned 
by a
 > > pagecache page, for example. Only a user can detach a bpf program from
 > > an existing cgroup.
 >
 > Yeah, but users can still detach the bpf programs from root cgroup.
 > IIUC, after detaching, the pointer in the bpf array will be 
empty_prog_array
 > which is just an array of NULL. Then __cgroup_bpf_run_filter_skb() will
 > deref it without checking NULL (as check_non_null == false).
 >
 > This matches the 0000000000000010 pointer seen in the bug reports,
 > the 0x10, that is 16, is the offset of items[] in struct bpf_prog_array.
 > So looks like we have to add a NULL check there regardless of refcnt.
 >
 > Also, I am not sure whether your suggested patch makes a difference
 > for percpu refcnt, as percpu_ref_put() will never call ->release() until
 > percpu_ref_kill(), which is never called on root cgroup?

 > Hm, true. But it means that the problem is not with the root cgroup's 
bpf?

 >How easy is to reproduce the problem? Is it possible to bisect the 
problematic
 >commit?

 >Thanks!

The tester found the following information during the test

The dmesg information is as follows (kernelv5.4) I don't know if it 
helps for this question


root@intel-x86-64:~# cgroup: cgroup: disabling cgroup2 socket matching 
due to net_prio or net_cls activation
IPv6: ADDRCONF(NETDEV_CHANGE): veth4c31d8d2: link becomes ready
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered disabled state
device veth4c31d8d2 entered promiscuous mode
cni0: port 1(veth4c31d8d2) entered blocking state
cni0: port 1(veth4c31d8d2) entered forwarding state
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev cni0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.2, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06
IPv6: ADDRCONF(NETDEV_CHANGE): vethb556dc7b: link becomes ready
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered disabled state
device vethb556dc7b entered promiscuous mode
cni0: port 2(vethb556dc7b) entered blocking state
cni0: port 2(vethb556dc7b) entered forwarding state
IPv4: martian source 10.244.1.3 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.1 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
IPv4: martian source 10.244.1.2 from 10.244.1.3, on dev eth0
ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06
-----------[ cut here ]-----------
percpu ref (cgroup_bpf_release_fn) <= 0 (-12) after switching to atomic
WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:161 
percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Modules linked in: ipt_REJECT nf_reject_ipv4 vxlan ip6_udp_tunnel 
udp_tunnel xt_statistic xt_nat xt_tcpudp iptable_mangle xt_comment 
xt_mark xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user 
iptable_nat xt_addrtype iptable_filter ip_tables xt_conntrack x_tables 
br_netfilter bridge stp llc bnep iTCO_wdt iTCO_vendor_support watchdog 
intel_powerclamp gpio_ich mgag200 drm_vram_helper drm_ttm_helper ttm 
i2c_i801 coretemp lpc_ich acpi_cpufreq sch_fq_codel openvswitch nsh 
nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfsd
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G I 5.7.0-yoctodev-standard #1
Hardware name: Intel Corporation S5520HC/S5520HC, BIOS 
S5500.86B.01.10.0025.030220091519 03/02/2009
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x12a/0x140
Code: 80 3d b1 42 3b 01 00 0f 85 56 ff ff ff 49 8b 54 24 d8 48 c7 c7 68 
57 1d a9 c6 05 98 42 3b 01 01 49 8b 74 24 e8 e8 ea 14 aa ff <0f> 0b e9 
32 ff ff ff 0f 0b eb 97 cc cc cc cc cc cc cc cc cc cc cc
RSP: 0018:ffff996183268e90 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 7ffffffffffffff3 RCX: 0000000000000000
RDX: 0000000000000102 RSI: ffffffffa9794aa7 RDI: 00000000ffffffff
RBP: ffff996183268ea8 R08: ffffffffa9794a60 R09: 0000000000000047
R10: 0000000080000001 R11: ffffffffa9794a8c R12: ffff95855904fef0
R13: 000023dc1ba33080 R14: ffff996183268ee0 R15: ffff95855904fef0
FS: 0000000000000000(0000) GS:ffff958563c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f62622f8658 CR3: 000000044fc0a000 CR4: 00000000000006e0
Call Trace:
<IRQ>
rcu_core+0x227/0x870
? timerqueue_add+0x68/0xa0
rcu_core_si+0xe/0x10
__do_softirq+0x102/0x358
? tick_program_event+0x4d/0x90
irq_exit+0xa0/0x110
smp_apic_timer_interrupt+0xa1/0x1b0
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:cpuidle_enter_state+0xc0/0x3c0
Code: 85 c0 0f 8f 28 02 00 00 31 ff e8 5b 1a 5f ff 45 84 ff 74 12 9c 58 
f6 c4 02 0f 85 d0 02 00 00 31 ff e8 34 ba 65 ff fb 45 85 e4 <0f> 88 cc 
00 00 00 49 63 cc 4c 2b 75 d0 48 6b c1 68 48 6b d1 38 48
RSP: 0018:ffff9961831c3e48 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
RAX: ffff958563c00000 RBX: ffffffffa9515e60 RCX: 000000000000001f
RDX: 0000000000000000 RSI: 000000003286e833 RDI: 0000000000000000
RBP: ffff9961831c3e88 R08: 0000000000000002 R09: 0000000000000018
R10: 0000000000000364 R11: ffff958563c2a284 R12: 0000000000000003
R13: ffffb9617fc3fd00 R14: 00000194af870de5 R15: 0000000000000000
? cpuidle_enter_state+0xa5/0x3c0
cpuidle_enter+0x2e/0x40
call_cpuidle+0x23/0x40
do_idle+0x1c6/0x240
cpu_startup_entry+0x20/0x30
start_secondary+0x15b/0x190
secondary_startup_64+0xb6/0xc0
--[ end trace 805031dac04b28f5 ]--
`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help