Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures
From: Topi Miettinen <hidden>
Date: 2020-10-22 10:39:30
Also in:
lkml
On 22.10.2020 10.54, Szabolcs Nagy wrote:
The 10/21/2020 22:44, Jeremy Linton wrote:quoted
There is a problem with glibc+systemd on BTI enabled systems. Systemd has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny PROT_EXEC changes. Glibc enables BTI only on segments which are marked as being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is caught by the seccomp filter, resulting in service failures. So, at the moment one has to pick either denying PROT_EXEC changes, or BTI. This is obviously not desirable. Various changes have been suggested, replacing the mprotect with mmap calls having PROT_BTI set on the original mapping, re-mmapping the segments, implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set, and various modification to seccomp to allow particular mprotect cases to bypass the filters. In each case there seems to be an undesirable attribute to the solution. So, whats the best solution?the easy fix in glibc is to ignore mprotect(PROT_BTI|PROT_EXEC) failures, so programs work with seccomp filters, but bti gets disabled (it's unreasonable to expect bti protection if mprotect is filtered). it will be a nasty silent failure though.
Some may also want to use seccomp filters so that they will immediately kill the process and in this case they couldn't do it.
and i'm also considering a fix that re-mmaps the executable segment with PROT_BTI instead of mprotect since that is not filtered. unfortunately the main exe is mmaped by the kernel without PROT_BTI and the libc does not have the fd to re-mmap. (bti can be left off for the main exe if mprotect fails and later we can teach the kernel to add bti there.) currently this is not a complete fix so i'm a bit hesitant about it. as for a kernel side fix: if there is a way to only filter PROT_EXEC mprotect on mappings that are not yet PROT_EXEC that would solve this problem (but likely needs new syscall or seccomp capability).
Problem with seccomp MDWX is that it's still possible for malicious programs to circumvent the filter by using memfd_create(), fill the memory with desired content and then use mmap(,,PROT_EXEC) to make it executable without triggering seccomp. This can be mitigated by filtering also memfd_create(), but then some programs want to use it. Also the protection can be bypassed if the program can write to a file system which isn't mounted with "noexec". This can be mitigated with private mount namespaces and global mount options, but again some programs are written to expect W & X. But I think SELinux has a more complete solution (execmem) which can track the pages better than is possible with seccomp solution which has a very narrow field of view. Maybe this facility could be made available to non-SELinux systems, for example with prctl()? Then the in-kernel MDWX could allow mprotect(PROT_EXEC | PROT_BTI) in case the backing file hasn't been modified, the source filesystem isn't writable for the calling process and the file descriptor isn't created with memfd_create(). -Topi _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel