Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Zhang,Qiang <hidden>
Date: 2020-06-23 09:01:55
On Mon, Jun 22, 2020 at 11:14:20AM -0700, Cong Wang wrote: > On Sat, Jun 20, 2020 at 8:58 AM Roman Gushchin [off-list ref] wrote: > > > > On Fri, Jun 19, 2020 at 08:00:41PM -0700, Cong Wang wrote: > > > On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin [off-list ref] wrote: > > > > > > > > On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote: > > > > > I think so, though I'm not familiar with the bfp cgroup code. > > > > > > > > > > > If so, we might wanna fix it in a different way, > > > > > > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put() > > > > > > like in cgroup_put(). It feels more reliable to me. > > > > > > > > > > > > > > > > Yeah I also have this idea in my mind. > > > > > > > > I wonder if the following patch will fix the issue? > > > > > > Interesting, AFAIU, this refcnt is for bpf programs attached > > > to the cgroup. By this suggestion, do you mean the root > > > cgroup does not need to refcnt the bpf programs attached > > > to it? This seems odd, as I don't see how root is different > > > from others in terms of bpf programs which can be attached > > > and detached in the same way. > > > > > > I certainly understand the root cgroup is never gone, but this > > > does not mean the bpf programs attached to it too. > > > > > > What am I missing? > > > > It's different because the root cgroup can't be deleted. > > > > All this reference counting is required to automatically detach bpf programs > > from a _deleted_ cgroup (look at cgroup_bpf_offline()). It's required > > because a cgroup can be in dying state for a long time being pinned by a > > pagecache page, for example. Only a user can detach a bpf program from > > an existing cgroup. > > Yeah, but users can still detach the bpf programs from root cgroup. > IIUC, after detaching, the pointer in the bpf array will be empty_prog_array > which is just an array of NULL. Then __cgroup_bpf_run_filter_skb() will > deref it without checking NULL (as check_non_null == false). > > This matches the 0000000000000010 pointer seen in the bug reports, > the 0x10, that is 16, is the offset of items[] in struct bpf_prog_array. > So looks like we have to add a NULL check there regardless of refcnt. > > Also, I am not sure whether your suggested patch makes a difference > for percpu refcnt, as percpu_ref_put() will never call ->release() until > percpu_ref_kill(), which is never called on root cgroup? > Hm, true. But it means that the problem is not with the root cgroup's bpf? >How easy is to reproduce the problem? Is it possible to bisect the problematic >commit? >Thanks! The tester found the following information during the test The dmesg information is as follows (kernelv5.4) I don't know if it helps for this question root@intel-x86-64:~# cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation IPv6: ADDRCONF(NETDEV_CHANGE): veth4c31d8d2: link becomes ready cni0: port 1(veth4c31d8d2) entered blocking state cni0: port 1(veth4c31d8d2) entered disabled state device veth4c31d8d2 entered promiscuous mode cni0: port 1(veth4c31d8d2) entered blocking state cni0: port 1(veth4c31d8d2) entered forwarding state IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev cni0 ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06 IPv4: martian source 10.244.1.2 from 10.244.1.2, on dev eth0 ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06 IPv4: martian source 10.244.1.1 from 10.244.1.2, on dev eth0 ll header: 00000000: ff ff ff ff ff ff 12 88 f0 cc 64 b8 08 06 IPv6: ADDRCONF(NETDEV_CHANGE): vethb556dc7b: link becomes ready cni0: port 2(vethb556dc7b) entered blocking state cni0: port 2(vethb556dc7b) entered disabled state device vethb556dc7b entered promiscuous mode cni0: port 2(vethb556dc7b) entered blocking state cni0: port 2(vethb556dc7b) entered forwarding state IPv4: martian source 10.244.1.3 from 10.244.1.3, on dev eth0 ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06 IPv4: martian source 10.244.1.1 from 10.244.1.3, on dev eth0 ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06 IPv4: martian source 10.244.1.2 from 10.244.1.3, on dev eth0 ll header: 00000000: ff ff ff ff ff ff 1a d7 25 1c ca 18 08 06 -----------[ cut here ]----------- percpu ref (cgroup_bpf_release_fn) <= 0 (-12) after switching to atomic WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x12a/0x140 Modules linked in: ipt_REJECT nf_reject_ipv4 vxlan ip6_udp_tunnel udp_tunnel xt_statistic xt_nat xt_tcpudp iptable_mangle xt_comment xt_mark xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user iptable_nat xt_addrtype iptable_filter ip_tables xt_conntrack x_tables br_netfilter bridge stp llc bnep iTCO_wdt iTCO_vendor_support watchdog intel_powerclamp gpio_ich mgag200 drm_vram_helper drm_ttm_helper ttm i2c_i801 coretemp lpc_ich acpi_cpufreq sch_fq_codel openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfsd CPU: 1 PID: 0 Comm: swapper/1 Tainted: G I 5.7.0-yoctodev-standard #1 Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009 RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x12a/0x140 Code: 80 3d b1 42 3b 01 00 0f 85 56 ff ff ff 49 8b 54 24 d8 48 c7 c7 68 57 1d a9 c6 05 98 42 3b 01 01 49 8b 74 24 e8 e8 ea 14 aa ff <0f> 0b e9 32 ff ff ff 0f 0b eb 97 cc cc cc cc cc cc cc cc cc cc cc RSP: 0018:ffff996183268e90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 7ffffffffffffff3 RCX: 0000000000000000 RDX: 0000000000000102 RSI: ffffffffa9794aa7 RDI: 00000000ffffffff RBP: ffff996183268ea8 R08: ffffffffa9794a60 R09: 0000000000000047 R10: 0000000080000001 R11: ffffffffa9794a8c R12: ffff95855904fef0 R13: 000023dc1ba33080 R14: ffff996183268ee0 R15: ffff95855904fef0 FS: 0000000000000000(0000) GS:ffff958563c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f62622f8658 CR3: 000000044fc0a000 CR4: 00000000000006e0 Call Trace: <IRQ> rcu_core+0x227/0x870 ? timerqueue_add+0x68/0xa0 rcu_core_si+0xe/0x10 __do_softirq+0x102/0x358 ? tick_program_event+0x4d/0x90 irq_exit+0xa0/0x110 smp_apic_timer_interrupt+0xa1/0x1b0 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:cpuidle_enter_state+0xc0/0x3c0 Code: 85 c0 0f 8f 28 02 00 00 31 ff e8 5b 1a 5f ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 d0 02 00 00 31 ff e8 34 ba 65 ff fb 45 85 e4 <0f> 88 cc 00 00 00 49 63 cc 4c 2b 75 d0 48 6b c1 68 48 6b d1 38 48 RSP: 0018:ffff9961831c3e48 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13 RAX: ffff958563c00000 RBX: ffffffffa9515e60 RCX: 000000000000001f RDX: 0000000000000000 RSI: 000000003286e833 RDI: 0000000000000000 RBP: ffff9961831c3e88 R08: 0000000000000002 R09: 0000000000000018 R10: 0000000000000364 R11: ffff958563c2a284 R12: 0000000000000003 R13: ffffb9617fc3fd00 R14: 00000194af870de5 R15: 0000000000000000 ? cpuidle_enter_state+0xa5/0x3c0 cpuidle_enter+0x2e/0x40 call_cpuidle+0x23/0x40 do_idle+0x1c6/0x240 cpu_startup_entry+0x20/0x30 start_secondary+0x15b/0x190 secondary_startup_64+0xb6/0xc0 --[ end trace 805031dac04b28f5 ]--