Re: [PATCH] capabilities: add capability cgroup controller

From: Petr Mladek <pmladek@suse.com>
Date: 2016-07-08 09:13:46
Also in: lkml

On Thu 2016-07-07 20:27:13, Topi Miettinen wrote:

On 07/07/16 09:16, Petr Mladek wrote:

quoted

On Sun 2016-07-03 15:08:07, Topi Miettinen wrote:

quoted

The attached patch would make any uses of capabilities generate audit
messages. It works for simple tests as you can see from the commit
message, but unfortunately the call to audit_cgroup_list() deadlocks the
system when booting a full blown OS. There's no deadlock when the call
is removed.

I guess that in some cases, cgroup_mutex and/or css_set_lock could be
already held earlier before entering audit_cgroup_list(). Holding the
locks is however required by task_cgroup_from_root(). Is there any way
to avoid this? For example, only print some kind of cgroup ID numbers
(are there unique and stable IDs, available without locks?) for those
cgroups where the task is registered in the audit message?

I am not sure if anyone know what really happens here. I suggest to
enable lockdep. It might detect possible deadlock even before it
really happens, see Documentation/locking/lockdep-design.txt

It can be enabled by

   CONFIG_PROVE_LOCKING=y

It depends on

    CONFIG_DEBUG_KERNEL=y

and maybe some more options, see lib/Kconfig.debug

Thanks a lot! I caught this stack dump:

starting version 230
[    3.416647] ------------[ cut here ]------------
[    3.417310] WARNING: CPU: 0 PID: 95 at
/home/topi/d/linux.git/kernel/locking/lockdep.c:2871
lockdep_trace_alloc+0xb4/0xc0
[    3.417605] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[    3.417923] Modules linked in:
[    3.418288] CPU: 0 PID: 95 Comm: systemd-udevd Not tainted 4.7.0-rc5+ #97
[    3.418444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Debian-1.8.2-1 04/01/2014
[    3.418726]  0000000000000086 000000007970f3b0 ffff88000016fb00
ffffffff813c9c45
[    3.418993]  ffff88000016fb50 0000000000000000 ffff88000016fb40
ffffffff81091e9b
[    3.419176]  00000b3705e2c798 0000000000000046 0000000000000410
00000000ffffffff
[    3.419374] Call Trace:
[    3.419511]  [<ffffffff813c9c45>] dump_stack+0x67/0x92
[    3.419644]  [<ffffffff81091e9b>] __warn+0xcb/0xf0
[    3.419745]  [<ffffffff81091f1f>] warn_slowpath_fmt+0x5f/0x80
[    3.419868]  [<ffffffff810e9a84>] lockdep_trace_alloc+0xb4/0xc0
[    3.419988]  [<ffffffff8120dc42>] kmem_cache_alloc_node+0x42/0x600
[    3.420156]  [<ffffffff8110432d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[    3.420170]  [<ffffffff8163183b>] __alloc_skb+0x5b/0x1d0
[    3.420170]  [<ffffffff81144f6b>] audit_log_start+0x29b/0x480
[    3.420170]  [<ffffffff810a2925>] ? __lock_task_sighand+0x95/0x270
[    3.420170]  [<ffffffff81145cc9>] audit_log_cap_use+0x39/0xf0
[    3.420170]  [<ffffffff8109cd75>] ns_capable+0x45/0x70
[    3.420170]  [<ffffffff8109cdb7>] capable+0x17/0x20
[    3.420170]  [<ffffffff812a2f50>] oom_score_adj_write+0x150/0x2f0
[    3.420170]  [<ffffffff81230997>] __vfs_write+0x37/0x160
[    3.420170]  [<ffffffff810e33b7>] ? update_fast_ctr+0x17/0x30
[    3.420170]  [<ffffffff810e3449>] ? percpu_down_read+0x49/0x90
[    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
[    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
[    3.420170]  [<ffffffff81231048>] vfs_write+0xb8/0x1b0
[    3.420170]  [<ffffffff812533c6>] ? __fget_light+0x66/0x90
[    3.420170]  [<ffffffff81232078>] SyS_write+0x58/0xc0
[    3.420170]  [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
[    3.420170]  [<ffffffff81849c9a>] entry_SYSCALL64_slow_path+0x25/0x25
[    3.420170] ---[ end trace fb586899fb556a5e ]---
[    3.447922] random: systemd-udevd urandom read with 3 bits of entropy
available
[    4.014078] clocksource: Switched to clocksource tsc
Begin: Loading essential drivers ... done.

This is with qemu and the boot continues normally. With real computer,
there's no such output and system just seems to freeze.

Could it be possible that the deadlock happens because there's some IO
towards /sys/fs/cgroup, which causes a capability check and that in turn
causes locking problems when we try to print cgroup list?

The above warning is printed by the code from
kernel/locking/lockdep.c:2871

static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
{
[...]
	/* We're only interested __GFP_FS allocations for now */
	if (!(gfp_mask & __GFP_FS))
		return;

	/*
	 * Oi! Can't be having __GFP_FS allocations with IRQs disabled.
	 */
	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
		return;


The backtrace shows that your new audit_log_cap_use() is called
from vfs_write(). You might try to use audit_log_start() with
GFP_NOFS instead of GFP_KERNEL.

Note that this is rather intuitive advice. I still need to learn a lot
about memory management and kernel in general to be more sure about
a correct solution.

Best Regards,
Petr

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help