Re: [PATCH 2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only
From: Lokesh Gidra <hidden>
Date: 2020-09-05 00:36:24
Also in:
linux-fsdevel, lkml
On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli [off-list ref] wrote:
Hello, On Mon, Aug 17, 2020 at 03:11:16PM -0700, Lokesh Gidra wrote:quoted
There has been an emphasis that Android is probably the only user for the restriction of userfaults from kernel-space and that it wouldn’t be useful anywhere else. I humbly disagree! There are various areas where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely user-space setting. Basically, any lazy, on-demand,For the record what I said is quoted below https://lkml.kernel.org/r/20200520194804.GJ26186@redhat.com : """It all boils down of how peculiar it is to be able to leverage only the acceleration [..] Right now there's a single user that can cope with that limitation [..] If there will be more users [..] it'd be fine to add a value "2" later.""" Specifically I never said "that it wouldn’t be useful anywhere else.".
Thanks a lot for clarifying.
Also I'm only arguing about the sysctl visible kABI change in patch 2/2: the flag passed as parameter to the syscall in patch 1/2 is all great, because seccomp needs it in the scalar parameter of the syscall to implement a filter equivalent to your sysctl "2" policy with only patch 1/2 applied. I've two more questions now: 1) why don't you enforce the block of kernel initiated faults with seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just an optimization to remove a few instructions per syscall in the bpf execution of Android unprivileged apps? You should block a lot of other syscalls by default to all unprivileged processes, including vmsplice. In other words if it's just for Android, why can't Android solve it with only patch 1/2 by tweaking the seccomp filter?
I would let Nick (nnk@) and Jeff (jeffv@) respond to this. The previous responses from both of them on this email thread (https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@mail.gmail.com/ (local) and https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@mail.gmail.com/ (local)) suggest that the performance overhead of seccomp-bpf is too much. Kees also objected to it (https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/ (local)) I'm not familiar with how seccomp-bpf works. All that I can add here is that userfaultfd syscall is usually not invoked in a performance critical code path. So, if the performance overhead of seccomp-bpf (if enabled) is observed on all syscalls originating from a process, then I'd say patch 2/2 is essential. Otherwise, it should be ok to let seccomp perform the same functionality instead.
2) given that Android is secure enough with the sysctl at value 2, why should we even retain the current sysctl 0 semantics? Why can't more secure systems just use seccomp and block userfaultfd, as it is already happens by default in the podman default seccomp whitelist (for those containers that don't define a new json whitelist in the OCI schema)? Shouldn't we focus our energy in making containers more secure by preventing the OCI schema of a random container to re-enable userfaultfd in the container seccomp filter instead of trying to solve this with a global sysctl? What's missing in my view is a kubernetes hard allowlist/denylist that cannot be overridden with the OCI schema in case people has the bad idea of running containers downloaded from a not fully trusted source, without adding virt isolation and that's an userland problem to be solved in the container runtime, not a kernel issue. Then you'd just add userfaultfd to the json of the k8s hard seccomp denylist instead of going around tweaking sysctl. What's your take in changing your 2/2 patch to just replace value "0" and avoid introducing a new value "2"?
SGTM. Disabling uffd completely for unprivileged processes can be achieved either using seccomp-bpf, or via SELinux, once the following patch series is upstreamed https://lore.kernel.org/lkml/20200827063522.2563293-1-lokeshgidra@google.com/ (local)
The value "0" was motivated by the concern that uffd can enlarge the race window for use after free by providing one more additional way to block kernel faults, but value "2" is already enough to solve that concern completely and it'll be the default on all Android. In other words by adding "2" you're effectively doing a more finegrined and more optimal implementation of "0" that remains useful and available to unprivileged apps and it already resolves all "robustness against side effects other kernel bugs" concerns. Clearly "0" is even more secure statistically but that would apply to every other syscall including vmsplice, and there's no /proc/sys/vm/unprivileged_vmsplice sysctl out there. The next issue we have now is with the pipe mutex (which is not a major concern but we need to solve it somehow for correctness). So I wonder if should make the default value to be "0" (or "2" if think we should not replace "0") and to allow only user initiated faults by default. Changing the sysctl default to be 0, will make live migration fail to switch to postcopy which will be (unnoticeable to the guest), instead of risking the VM to be killed because of network latency outlier. Then we wouldn't need to change the pipe code at all.
SGTM. I can change the default value to '0' (or '2') in the next revision of patch 2/2, unless somebody objects to this.
Alternatively we could still fix the pipe code so it runs better (but it'll be more complex) or to disable uffd faults only in the pipe code. One thing to keep in mind is that if we change the default, then hypervisor hosts running QEMU would need to set: vm.userfaultfd = 1 in /etc/sysctl.conf if postcopy live migration is required, that's not particularly concerning constraint for qemu (there are likely other tweaks required and it looks less risky than an arbitrary timeout which could kill the VM: if the above is forgotten the postcopy live migration won't even start and it'll be unnoticeable to the guest). The main concern really are future apps that may want to use uffd for kernel initiated faults won't be allowed to do so by default anymore, those apps will be heavily incentivated to use bounce buffers before passing data to syscalls, similarly to the current use case of patch 2/2. Comments welcome, Andrea PS. Another usage of uffd that remains possible without privilege with the 2/2 patch sysctl "2" behavior (besides the strict SIGSEGV acceleration) is the UFFD_FEATURE_SIGBUS. That's good so a malloc lib will remain possible without requiring extra privileges, by adding a UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS (UFFDIO_POPULATE just needs to zero out a page and map it, it'll be indistinguishable to UFFDIO_ZEROPAGE but it will solve the last performance bottleneck by avoiding a wrprotect fault after the allocation and it will be THP capable too). Memory will be freed with MADV_DONTNEED, without ever having to call mmap/mumap. It could move memory around with UFFDIO_COPY+MADV_DONTNEED or by adding UFFDIO_REMAP which already exists.
UFFDIO_POPULATE sounds like a really useful feature. I don't see it in the kernel yet. Is there a patch under work on this? If so, kindly share.