Re: Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

From: Mahesh Bandewar (महेश बंडेवार) <hidden>
Date: 2017-11-09 07:13:06
Also in: linux-api

On Thu, Nov 9, 2017 at 12:21 PM, Serge E. Hallyn [off-list ref] wrote:

On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार)

wrote:

quoted

On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner
[off-list ref] wrote:

quoted

On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश

बंडेवार) wrote:

quoted

Sorry folks I was traveling and seems like lot happened on this

thread. :p

quoted

I will try to response few of these comments selectively -

quoted

The thing that makes me hesitate with this set is that it is a
permanent new feature to address what (I hope) is a temporary
problem.

I agree this is permanent new feature but it's not solving a temporary
problem. It's impossible to assess what and when new vulnerability
that could show up. I think Daniel summed it up appropriately in his
response

quoted

Seems like there are two naive ways to do it, the first being to

just

quoted

look at all code under ns_capable() plus code called from there.  It
seems like looking at the result of that could be fruitful.

This is really hard. The main issue that there were features designed
and developed before user-ns days with an assumption that unprivileged
users will never get certain capabilities which only root user gets.
Now that is not true anymore with user-ns creation with mapping root
for any process. Also at the same time blocking user-ns creation for
eveyone is a big-hammer which is not needed too. So it's not that easy
to just perform a code-walk-though and correct those decisions now.

quoted

It seems to me that the existing control in
/proc/sys/kernel/unprivileged_userns_clone might be the better duct

tape

quoted

in that case.

This solution is essentially blocking unprivileged users from using
the user-namespaces entirely. This is not really a solution that can
work. The solution that this patch-set adds allows unprivileged users
to create user-namespaces. Actually the proposed solution is more
fine-grained approach than the unprivileged_userns_clone solution
since you can selectively block capabilities rather than completely
blocking the functionality.

I've been talking to Stéphane today about this and we should also keep

in mind

quoted

that we have:

chb@conventiont|~

quoted

ls -al /proc/sys/user/

total 0
dr-xr-xr-x 1 root root 0 Nov  6 23:32 .
dr-xr-xr-x 1 root root 0 Nov  2 22:13 ..
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_cgroup_namespaces
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_instances
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_watches
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_ipc_namespaces
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_mnt_namespaces
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_net_namespaces
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_pid_namespaces
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_user_namespaces
-rw-r--r-- 1 root root 0 Nov  8 19:48 max_uts_namespaces

These files allow you to limit the number of namespaces that can be

created

quoted

*per namespace* type. So let's say your system runs a bunch of user

namespaces

quoted

you can do:

chb@conventiont|~

quoted

echo 0 > /proc/sys/user/max_user_namespaces

So that the next time you try to create a user namespaces you'd see:

chb@conventiont|~

quoted

unshare -U

unshare: unshare failed: No space left on device

So there's not even a need to upstream a new sysctl since we have ways

of

quoted

blocking this.

I'm not sure how it's solving the problem that my patch-set is

addressing?

quoted

I agree though that the need for unprivileged_userns_clone sysctl goes
away as this is equivalent to setting that sysctl to 0 as you have
described above.

oh right that was the reasoning iirc for not needing the other sysctl.

quoted

However as I mentioned earlier, blocking processes from creating
user-namespaces is not the solution. Processes should be able to
create namespaces as they are designed but at the same time we need to
have controls to 'contain' them if a need arise. Setting max_no to 0
is not the solution that I'm looking for since it doesn't solve the
problem.

well yesterday we were told that was explicitly not the goal, but that was
not by you ... i just mention it to explain why we seem to be walking in
circles a bit.

anyway the bounding set doesn't actually make sense so forget that.   the
question then is just whether it makes sense to allow things to continue
at all in this situation.  would you mind indulging me by giving one or

two

concrete examples in the previous known cves of what capabilities you

would

have dropped tto allow the rest to continue to be safely used?

Of course. Let's take an example of the CVE that I have mentioned in my
cover-letter - CVE-2017-7308
<https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-7308>. It's well
documented and even has a exploit c-program
<https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308> that
can demonstrate how it can be used against non-patched kernel. There is
very nice blog post
<https://googleprojectzero.blogspot.kr/2017/05/exploiting-linux-kernel-via-packet.html>
about this vulnerability by Andrey Konovalov.

This is about the AF_PACKET socket interface that is protected behind
NET_RAW capability. This capability is not available to unprivileged user.
However, any unprivileged user can get NET_RAW capability (as demonstrated
in the cover-letter code that I have attached in this patch series) so this
NET_RAW capability is available to any unprivileged user on the host if the
kernel has user-namespaces available.

With this patch-set applied, all that is needed is to flip a bit with the
sysctl (kernel.controlled_userns_caps_whitelist) as demonstrated below -

root@lphh6:~# uname -a
Linux lphh6 4.14.0-smp-DEV #97 SMP @1510203579 x86_64 GNU/Linux
root@lphh6:~# sysctl -q kernel.controlled_userns_caps_whitelist
kernel.controlled_userns_caps_whitelist = 1f,ffffffff

Now when I run the program (demo from the cover-letter) as a normal
unprivileged user I can't create a RAW socket in init-ns but I can in the
child-ns.

dumbo@lphh6:~$ /tmp/acquire_raw
Attempting to open RAW socket before unshare()...
socket() SOCK_RAW failed: : Operation not permitted
Attempting to open RAW socket after unshare()...
Successfully opened RAW-Sock after unshare().
dumbo@lphh6:~$

Now as a root user. Take off CAP_NET_RAW

root@lphh6:~# sysctl -w kernel.controlled_userns_caps_whitelist=1f,ffffdfff
kernel.controlled_userns_caps_whitelist = 1f,ffffdfff
root@lphh6:~#

Now run the same program as an unprivileged user -

dumbo@lphh6:~$ /tmp/acquire_raw
Attempting to open RAW socket before unshare()...
socket() SOCK_RAW failed: : Operation not permitted
Attempting to open RAW socket after unshare()...
socket() SOCK_RAW failed: : Operation not permitted
dumbo@lphh6:~$

Notice that it has failed to create a raw socket in init and in child
namespace. It's not blocking creation of user-namespaces but allowing admin
turn individual capability bits on and off.

This is very simplistic example of just demonstrating how capability bits
turn-on/off works. So let's assume a sandboxed environment where we don't
know what a binary that we are about run in an environment which is
identified as susceptible. By turning off the NET_RAW bit, the admin gets
an assurance that system is safe and if binary fails because it's not
getting this capability then that bad but a sad consequence (without
compromising the host integrity) but if it doesn't use the NET_RAW
capability but any other combination of remaining 36 capabilities, it would
get whatever is necessary. This means we can safely allow processes to
create user-namespaces by taking off certain capabilities in question for
temporary/extended period until proper fix is applied without compromising
the system integrity. The impact will vary based on which capability is
taken off and admin would / should be ware of for the environment that
he/she is dealing with.

thanks,
--mahesh..

thanks,
serge

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help