Re: [QUESTION] Full user space process isolation?

From: Roberto Sassu <hidden>
Date: 2023-07-06 15:05:31
Also in: keyrings, linux-hardening, linux-integrity, lkml

On Thu, 2023-07-06 at 13:35 +0200, Roberto Sassu wrote:

On Thu, 2023-07-06 at 05:53 -0500, Dr. Greg wrote:

quoted

On Tue, Jul 04, 2023 at 05:18:43PM +0200, Petr Tesarik wrote:

Good morning, I hope the week is going well for everyone.

quoted

On 7/3/2023 5:28 PM, Roberto Sassu wrote:

quoted

On Mon, 2023-07-03 at 17:06 +0200, Jann Horn wrote:

quoted

On Thu, Jun 22, 2023 at 4:45???PM Roberto Sassu
[off-list ref] wrote:

quoted

I wanted to execute some kernel workloads in a fully isolated user
space process, started from a binary statically linked with klibc,
connected to the kernel only through a pipe.

FWIW, the kernel has some infrastructure for this already, see
CONFIG_USERMODE_DRIVER and kernel/usermode_driver.c, with a usage
example in net/bpfilter/.

Thanks, I actually took that code to make a generic UMD management
library, that can be used by all use cases:

https://lore.kernel.org/linux-kernel/20230317145240.363908-1-roberto.sassu@huaweicloud.com/ (local)

quoted

I also wanted that, for the root user, tampering with that process is
as hard as if the same code runs in kernel space.

I believe that actually making it that hard would probably mean that
you'd have to ensure that the process doesn't use swap (in other
words, it would have to run with all memory locked), because root can
choose where swapped pages are stored. Other than that, if you mark it
as a kthread so that no ptrace access is allowed, you can probably get
pretty close. But if you do anything like that, please leave some way
(like a kernel build config option or such) to enable debugging for
these processes.

I didn't think about the swapping part... thanks!

Ok to enable debugging with a config option.

quoted

But I'm not convinced that it makes sense to try to draw a security
boundary between fully-privileged root (with the ability to mount
things and configure swap and so on) and the kernel - my understanding
is that some kernel subsystems don't treat root-to-kernel privilege
escalation issues as security bugs that have to be fixed.

Yes, that is unfortunately true, and in that case the trustworthy UMD
would not make things worse. On the other hand, on systems where that
separation is defined, the advantage would be to run more exploitable
code in user space, leaving the kernel safe.

I'm thinking about all the cases where the code had to be included in
the kernel to run at the same privilege level, but would not use any of
the kernel facilities (e.g. parsers).

Thanks for reminding me of kexec-tools. The complete image for booting a
new kernel was originally prepared in user space. With kernel lockdown,
all this code had to move into the kernel, adding a new syscall and lots
of complexity to build purgatory code, etc. Yet, this new implementation
in the kernel does not offer all features of kexec-tools, so both code
bases continue to exist and are happily diverging...

quoted

If the boundary is extended to user space, some of these components
could be moved away from the kernel, and the functionality would be the
same without decreasing the security.

quoted

All right, AFAICS your idea is limited to relatively simple cases
for now. I mean, allowing kexec-tools to run in user space is not
easily possible when UID 0 is not trusted, because kexec needs to
open various files and make various other syscalls, which would
require a complex LSM policy. It looks technically possible to write
one, but then the big question is if it would be simpler to review
and maintain than adding more kexec-tools features to the kernel.

You either need to develop and maintain a complex system-wide LSM
policy or you need a security model that is specifically tuned and
then scoped to the needs of the workload running on behalf of the
kernel as a UID=0 userspace process.

As I noted in my e-mail to Roberto, our TSEM LSM brings forward the
ability to do both, as a useful side effect of the need to limit model
complexity when the objective is to have a single functional
description of the security state of a system.

quoted

Anyway, I can sense a general desire to run less code in the most
privileged system environment. Robert's proposal is one of few that
go in this direction. What are the alternatives?

As I noted above, TSEM brings the ability to provide highly specific
and narrowly scoped security policy to a process heirarchy
ie. workload.

However, regardless of the technology applied, in order to pursue
Roberto's UMD model of having a uid=0 process run tasks on behalf of
the kernel, there would seem to be a need to define what the security
objectives are.

From the outside looking in, there would seem to be a need to address
two primary issues:

1: Trust/constrain what the UMD process can do.

Very simple:

read from a kernel-opened fd, write to another kernel-opened fd, close
the fds and exit.

With the seccomp strict profile, a process cannot call any other system
call, and it gets killed if it does.

I tried to write a BPF filter, to see how far I can go, and that seems
sufficient to constrain what the UMD process can do.

Please note that the UMD process setup is done by the kernel, before
any user space code has the chance to run. The kernel is responsible to
properly establish the communication with the UMD process.

quoted

2: Constrain what the system at large can do to the UMD process.

If someone outside can influence the behavior of the UMD process,
meaning altering the result, that would be unacceptable.

I found that denying ptrace on the UMD process as target, more or less
covers everything, even trying to read or write /proc/<pid>/fd/<N>.

There might be something more subtle, like what Iann pointed out, avoid
swapping of the UMD process, as there is no integrity check when the
page comes back.

Other than that, I was limiting the kill, maybe we have to do something
similar with io_uring (but we would know if the UMD process uses it).
With that in place, the UMD process seems pretty much isolated.

I would definitely not complicate things more than that, seems that
this problem is already difficult enough to solve.

Since the goal is very specific, I think writing a very small LSM would
make sense. With SELinux or Smack, you could also do it, but you have
to care about loading a policy, enforcing, etc..

The main question is if the kernel is able to enforce isolation on the
UMD process as it would do for itself.

For those who didn't receive the patch set I just sent, I worked around
the first problem of supporting PGP for verifying the authenticity of
RPM headers and use them with IMA Appraisal.

I introduced in the kernel a new key format, TLV-based, and plan to let
Linux distribution vendors convert PGP keys to this new format in their
building infrastructure (trusted). The converted keys are embedded in
the kernel image.

Signatures can be converted in user space at run-time, since altering
them would make signature verification fail.

You can find the patch set here:

https://lore.kernel.org/linux-integrity/20230706144225.1046544-1-roberto.sassu@huaweicloud.com/ (local)

Thanks

Roberto

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help