Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only

From: Lokesh Gidra <hidden>
Date: 2020-09-05 00:36:24
Also in: linux-fsdevel, lkml

On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli [off-list ref] wrote:

Hello,

On Mon, Aug 17, 2020 at 03:11:16PM -0700, Lokesh Gidra wrote:

quoted

There has been an emphasis that Android is probably the only user for
the restriction of userfaults from kernel-space and that it wouldn’t
be useful anywhere else. I humbly disagree! There are various areas
where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely
user-space setting. Basically, any lazy, on-demand,

For the record what I said is quoted below
https://lkml.kernel.org/r/20200520194804.GJ26186@redhat.com :

"""It all boils down of how peculiar it is to be able to leverage only
the acceleration [..] Right now there's a single user that can cope
with that limitation [..] If there will be more users [..]  it'd be
fine to add a value "2" later."""

Specifically I never said "that it wouldn’t be useful anywhere else.".

Thanks a lot for clarifying.

Also I'm only arguing about the sysctl visible kABI change in patch
2/2: the flag passed as parameter to the syscall in patch 1/2 is all
great, because seccomp needs it in the scalar parameter of the syscall
to implement a filter equivalent to your sysctl "2" policy with only
patch 1/2 applied.

I've two more questions now:

1) why don't you enforce the block of kernel initiated faults with
   seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just
   an optimization to remove a few instructions per syscall in the bpf
   execution of Android unprivileged apps? You should block a lot of
   other syscalls by default to all unprivileged processes, including
   vmsplice.

   In other words if it's just for Android, why can't Android solve it
   with only patch 1/2 by tweaking the seccomp filter?

I would let Nick (nnk@) and Jeff (jeffv@) respond to this.

The previous responses from both of them on this email thread
(https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@mail.gmail.com/ (local)
and https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@mail.gmail.com/ (local))
suggest that the performance overhead of seccomp-bpf is too much. Kees
also objected to it
(https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/ (local))

I'm not familiar with how seccomp-bpf works. All that I can add here
is that userfaultfd syscall is usually not invoked in a performance
critical code path. So, if the performance overhead of seccomp-bpf (if
enabled) is observed on all syscalls originating from a process, then
I'd say patch 2/2 is essential. Otherwise, it should be ok to let
seccomp perform the same functionality instead.

2) given that Android is secure enough with the sysctl at value 2, why
   should we even retain the current sysctl 0 semantics? Why can't
   more secure systems just use seccomp and block userfaultfd, as it
   is already happens by default in the podman default seccomp
   whitelist (for those containers that don't define a new json
   whitelist in the OCI schema)? Shouldn't we focus our energy in
   making containers more secure by preventing the OCI schema of a
   random container to re-enable userfaultfd in the container seccomp
   filter instead of trying to solve this with a global sysctl?

   What's missing in my view is a kubernetes hard allowlist/denylist
   that cannot be overridden with the OCI schema in case people has
   the bad idea of running containers downloaded from a not fully
   trusted source, without adding virt isolation and that's an
   userland problem to be solved in the container runtime, not a
   kernel issue. Then you'd just add userfaultfd to the json of the
   k8s hard seccomp denylist instead of going around tweaking sysctl.

What's your take in changing your 2/2 patch to just replace value "0"
and avoid introducing a new value "2"?

SGTM. Disabling uffd completely for unprivileged processes can be
achieved either using seccomp-bpf, or via SELinux, once the following
patch series is upstreamed
https://lore.kernel.org/lkml/20200827063522.2563293-1-lokeshgidra@google.com/ (local)

The value "0" was motivated by the concern that uffd can enlarge the
race window for use after free by providing one more additional way to
block kernel faults, but value "2" is already enough to solve that
concern completely and it'll be the default on all Android.

In other words by adding "2" you're effectively doing a more
finegrined and more optimal implementation of "0" that remains useful
and available to unprivileged apps and it already resolves all
"robustness against side effects other kernel bugs" concerns. Clearly
"0" is even more secure statistically but that would apply to every
other syscall including vmsplice, and there's no
/proc/sys/vm/unprivileged_vmsplice sysctl out there.

The next issue we have now is with the pipe mutex (which is not a
major concern but we need to solve it somehow for correctness). So I
wonder if should make the default value to be "0" (or "2" if think we
should not replace "0") and to allow only user initiated faults by
default.

Changing the sysctl default to be 0, will make live migration fail to
switch to postcopy which will be (unnoticeable to the guest), instead
of risking the VM to be killed because of network latency
outlier. Then we wouldn't need to change the pipe code at all.

SGTM. I can change the default value to '0' (or '2') in the next
revision of patch 2/2, unless somebody objects to this.

Alternatively we could still fix the pipe code so it runs better (but
it'll be more complex) or to disable uffd faults only in the pipe
code.

One thing to keep in mind is that if we change the default, then
hypervisor hosts running QEMU would need to set:

vm.userfaultfd = 1

in /etc/sysctl.conf if postcopy live migration is required, that's not
particularly concerning constraint for qemu (there are likely other
tweaks required and it looks less risky than an arbitrary timeout
which could kill the VM: if the above is forgotten the postcopy live
migration won't even start and it'll be unnoticeable to the guest).

The main concern really are future apps that may want to use uffd for
kernel initiated faults won't be allowed to do so by default anymore,
those apps will be heavily incentivated to use bounce buffers before
passing data to syscalls, similarly to the current use case of patch 2/2.

Comments welcome,
Andrea

PS. Another usage of uffd that remains possible without privilege with
the 2/2 patch sysctl "2" behavior (besides the strict SIGSEGV
acceleration) is the UFFD_FEATURE_SIGBUS. That's good so a malloc lib
will remain possible without requiring extra privileges, by adding a
UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS
(UFFDIO_POPULATE just needs to zero out a page and map it, it'll be
indistinguishable to UFFDIO_ZEROPAGE but it will solve the last
performance bottleneck by avoiding a wrprotect fault after the
allocation and it will be THP capable too). Memory will be freed with
MADV_DONTNEED, without ever having to call mmap/mumap. It could move
memory around with UFFDIO_COPY+MADV_DONTNEED or by adding UFFDIO_REMAP
which already exists.

UFFDIO_POPULATE sounds like a really useful feature. I don't see it in
the kernel yet. Is there a patch under work on this? If so, kindly
share.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help