Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only

From: Lokesh Gidra <hidden>
Date: 2020-08-17 22:11:38
Also in: linux-fsdevel, lkml

On Wed, Aug 5, 2020 at 10:44 PM Michael S. Tsirkin [off-list ref] wrote:

On Wed, Aug 05, 2020 at 05:43:02PM -0700, Nick Kralevich wrote:

quoted

On Fri, Jul 24, 2020 at 6:40 AM Michael S. Tsirkin [off-list ref] wrote:

quoted

On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote:

quoted

On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra [off-list ref] wrote:

quoted

From the discussion so far it seems that there is a consensus that
patch 1/2 in this series should be upstreamed in any case. Is there
anything that is pending on that patch?

That's my reading of this thread too.

quoted

Unless I'm mistaken that you can already enforce bit 1 of the second
parameter of the userfaultfd syscall to be set with seccomp-bpf, this
would be more a question to the Android userland team.

The question would be: does it ever happen that a seccomp filter isn't
already applied to unprivileged software running without
SYS_CAP_PTRACE capability?

Yes.

Android uses selinux as our primary sandboxing mechanism. We do use
seccomp on a few processes, but we have found that it has a
surprisingly high performance cost [1] on arm64 devices so turning it
on system wide is not a good option.

[1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad (local)

As Jeff mentioned, seccomp is used strategically on Android, but is
not applied to all processes. It's too expensive and impractical when
simpler implementations (such as this sysctl) can exist. It's also
significantly simpler to test a sysctl value for correctness as
opposed to a seccomp filter.

Given that selinux is already used system-wide on Android, what is wrong
with using selinux to control userfaultfd as opposed to seccomp?

Userfaultfd file descriptors will be generally controlled by SELinux.
You can see the patchset at
https://lore.kernel.org/lkml/20200401213903.182112-3-dancol@google.com/ (local)
(which is also referenced in the original commit message for this
patchset). However, the SELinux patchset doesn't include the ability
to control FAULT_FLAG_USER / UFFD_USER_MODE_ONLY directly.

SELinux already has the ability to control who gets CAP_SYS_PTRACE,
which combined with this patch, is largely equivalent to direct
UFFD_USER_MODE_ONLY checks. Additionally, with the SELinux patch
above, movement of userfaultfd file descriptors can be mediated by
SELinux, preventing one process from acquiring userfaultfd descriptors
of other processes unless allowed by security policy.

It's an interesting question whether finer-grain SELinux support for
controlling UFFD_USER_MODE_ONLY should be added. I can see some
advantages to implementing this. However, we don't need to decide that
now.

Kernel security checks generally break down into DAC (discretionary
access control) and MAC (mandatory access control) controls. Most
kernel security features check via both of these mechanisms. Security
attributes of the system should be settable without necessarily
relying on an LSM such as SELinux. This patch follows the same basic
model -- system wide control of a hardening feature is provided by the
unprivileged_userfaultfd_user_mode_only sysctl (DAC), and if needed,
SELinux support for this can also be implemented on top of the DAC
controls.

This DAC/MAC split has been successful in several other security
features. For example, the ability to map at page zero is controlled
in DAC via the mmap_min_addr sysctl [1], and via SELinux via the
mmap_zero access vector [2]. Similarly, access to the kernel ring
buffer is controlled both via DAC as the dmesg_restrict sysctl [3], as
well as the SELinux syslog_read [2] check. Indeed, the dmesg_restrict
sysctl is very similar to this patch -- it introduces a capability
(CAP_SYSLOG, CAP_SYS_PTRACE) check on access to a sensitive resource.

If we want to ensure that a security feature will be well tested and
vetted, it's important to not limit its use to LSMs only. This ensures
that kernel and application developers will always be able to test the
effects of a security feature, without relying on LSMs like SELinux.
It also ensures that all distributions can enable this security
mitigation should it be necessary for their unique environments,
without introducing an SELinux dependency. And this patch does not
preclude an SELinux implementation should it be necessary.

Even if we decide to implement fine-grain SELinux controls on
UFFD_USER_MODE_ONLY, we still need this patch. We shouldn't make this
an either/or choice between SELinux and this patch. Both are
necessary.

-- Nick

[1] https://wiki.debian.org/mmap_min_addr
[2] https://selinuxproject.org/page/NB_ObjectClassesPermissions
[3] https://www.kernel.org/doc/Documentation/sysctl/kernel.txt

I am not sure I agree this is similar to dmesg access.

The reason I say it is this: it is pretty easy for admins to know
whether they run something that needs to access the kernel ring buffer.
Or if it's a tool developer poking at dmesg, they can tell admins "we
need these permissions".  But it seems impossible for either an admin to
know that a userfaultfd page e.g. used with shared memory is accessed
from the kernel.

So I guess the question is: how does anyone not running Android
know to set this flag?

I got the feeling it's not really possible, and so for a single-user
feature like this a single API seems enough.  Given a choice between a
knob an admin is supposed to set and selinux policy written by
presumably knowledgeable OS vendors, I'd opt for a second option.

Hope this helps.

There has been an emphasis that Android is probably the only user for
the restriction of userfaults from kernel-space and that it wouldn’t
be useful anywhere else. I humbly disagree! There are various areas
where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely
user-space setting. Basically, any lazy, on-demand,
initialization/decompression/loading could be a good candidate for
this trick. My project happens to be one of them. In fact, in Android
we are also thinking of using it in some other places, all in
user-space. And given that userfaultfd is an efficient replacement for
this trick [1], there are various scenarios which would benefit from
the restriction of userfaults from kernel-space, provided the admins
care about security on such devices. IIUC, a security admin would
never trust an unprivileged process with userfaults from kernel space.
Therefore, a sysctl knob restriction with CAP_SYS_PTRACE for
privileged processes seems like the right choice to me.

Coming to sysctl vs. SELinux debate, I think wherever the role of OS
vendor and admin is played by different people, I doubt a generic
SELinux policy set by the former will be blindly acceptable to the
latter. Furthermore, I’m not sure if an admin is expected to even know
which packages running on their system are using userfaultfd. So they
anyway have to rely on developers reaching out to get the required
permission. With the new sysctl knob enabled, the number of such
requests is only going to decrease.

[1] https://www.kernel.org/doc/Documentation/vm/userfaultfd.txt

quoted


If answer is "no" the behavior of the new sysctl in patch 2/2 (in
subject) should be enforceable with minor changes to the BPF
assembly. Otherwise it'd require more changes.

It would be good to understand what these changes are.

quoted

Why exactly is it preferable to enlarge the surface of attack of the
kernel and take the risk there is a real bug in userfaultfd code (not
just a facilitation of exploiting some other kernel bug) that leads to
a privilege escalation, when you still break 99% of userfaultfd users,
if you set with option "2"?

I can see your point if you think about the feature as a whole.
However, distributions (such as Android) have specialized knowledge of
their security environments, and may not want to support the typical
usages of userfaultfd. For such distributions, providing a mechanism
to prevent userfaultfd from being useful as an exploit primitive,
while still allowing the very limited use of userfaultfd for userspace
faults only, is desirable. Distributions shouldn't be forced into
supporting 100% of the use cases envisioned by userfaultfd when their
needs may be more specialized, and this sysctl knob empowers
distributions to make this choice for themselves.

quoted

Is the system owner really going to purely run on his systems CRIU
postcopy live migration (which already runs with CAP_SYS_PTRACE) and
nothing else that could break?

This is a great example of a capability which a distribution may not
want to support, due to distribution specific security policies.

quoted

Option "2" to me looks with a single possible user, and incidentally
this single user can already enforce model "2" by only tweaking its
seccomp-bpf filters without applying 2/2. It'd be a bug if android
apps runs unprotected by seccomp regardless of 2/2.

Can you elaborate on what bug is present by processes being
unprotected by seccomp?

Seccomp cannot be universally applied on Android due to previously
mentioned performance concerns. Seccomp is used in Android primarily
as a tool to enforce the list of allowed syscalls, so that such
syscalls can be audited before being included as part of the Android
API.

-- Nick

--
Nick Kralevich | nnk@google.com


--
Nick Kralevich | nnk@google.com

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help