Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

From: Indan Zupancic <hidden>
Date: 2012-02-23 00:25:14
Also in: lkml, netdev

On Wed, February 22, 2012 20:47, Will Drewry wrote:

On Wed, Feb 22, 2012 at 2:19 AM, Indan Zupancic [off-list ref] wrote:

quoted

I highly disagree with every filter having to check the mode: Filters that
don't check the arch on e.g. x86 are buggy, so they have to check it, even
if it's a 32-bit or 64-bit only system, the filters can't know that and
needs to check the arch at every syscall entry. All other info in the data
depends on the arch, because of this there isn't much code to share between
the two archs, so you can as well have one filter for each arch.

Alternative approach: Tell the arch at filter install time and only run the
filters with the same arch as the current system call. If no filters are run,
deny the systemcall.

This was roughly how I first implemented compat and non-compat
support.  It causes some implicit behavior across inheritance that is
not nice though.

Same implicit behaviour Ben mentioned or something else?

Yeah, that's a bit of a problem. It can be solved within filters, but
it's starting to get more obscure than just checking the arch for every
syscall.

quoted

Advantages:

- Filters don't have to check the arch every syscall entry.

This I like.

quoted

- Secure by default. Filters don't have to do anything arch specific to
 be secure, no surprises possible.

This is partially true, but it is exactly why I hid compat before.

quoted

- If a new arch comes into existence, there is no chance of old filters
 becoming buggy and insecure. This is especially true for archs that
 had only one mode, but added another one later on: Old filters had no
 need to check the mode at all.

Perhaps.  A buggy filter that works on x86-64 might be exposed on a
new x32 ABI.  It's hard to predict how audit_arch and the syscall abi
will develop with new platforms.

It doesn't matter, if the filter assumes there are only two archs possible
and those two archs need different treatment, then the new arch at best
will only match one of them and it depends on which arch the filter checks
for. E.g. whether it does:

if (arch == AUDIT_ARCH_I386)
	...
else /* assume x86_64 */
	...

versus

if (arch == AUDIT_ARCH_X86_64)
	...
else /* assume i386 */
	...

quoted

- For kernels supporting only one arch, the check can be optimised away,
 by not installing unsupported arch filters at all.

Somewhat.  Without having a dedicated arch helper, you'd have to guess
that arches only support one or two arches (if compat is supported),
but I don't know if that is a safe assumption to make.

Well, if you want to optimise all checks away, then you obviously need
arch helpers. Without it, you have to install all filters, even the ones
you'll never run.

quoted

It's more secure, faster and simpler for the filters.

If something like this is implemented it's fine to expose the arch info
in the syscall data too, and have a way to install filters for all archs,
for the few cases where that might be useful, although I can't think of
any reason why people would like to do unnecessary work in the filters.

It seems to just add complexity to support both. I think we'll
probably end up with it in the filters for better or worse.  Possibly
JITing will be useful since at least a 32-bit load and je is pretty
cheap in native instructions.

Yeah, except that you can't easily do that because you don't have direct
access to the arch.

quoted

All that's needed is an extra argument to the prctl() call. I propose
0 for the current arch, -1 for all archs and anything else to specify
the arch. Installing a filter for an unsupported arch could return
ENOEXEC.

Without adding a per-arch call, there is no way to know all the
supported arches at install time.  Current arch, at least, can be
determined with a call to syscall_get_arch().

True.

As is, I'm not sure it makes sense to try to reserve two extra input
types: 0 and -1.  0 would be sane to treat as either a wildcard or
current because it is unlikely to be used by AUDIT_ARCH_* ever since
EM_NONE is assigned to 0.  However, I have no such insight into
whether it will ever be possible to compose 0xffffffff as an
AUDIT_ARCH_.

That seems impossible.

quoted

As far as the implementation goes, either have a list per supported arch
or store the arch per filter and check that before running the filter.

You can't do it per arch without adding even more per-arch
dependencies.  Keeping them annotated in the same list is the clearest
way I've seen so far, but it comes with its own burdens.

You could have a list per installed arch, so there is no need to know all
supported archs, if you don't have the per arch helpers.

I don't see how keeping the arch in the filter itself comes with a burden,
that's what you were basically doing with the compat flag anyway.

But keeping the check within the filter is the simplest solution it seems,
so ignore my objections and just let's live with the extra hassle at the
user space side.

quoted

What use is the instruction pointer considering it tells nothing about
the call path?

My fear of exposing the IP is that people will erroneously assume that it
says anything about the call path, and hence write insecure security code
that's easily bypassed by just jumping to the right instruction address.
And if the vDSO is used, the IP will always be the same.

So what use is knowing the IP?

quoted

Wouldn't it make sense to allow going from mode 2 to 1?
After all, the filter could have blocked it if it didn't
want to permit it, and mode 1 is more restrictive than
mode 2.

Nope - that might allow a downgrade that bypasses write/read
restrictions.  E.g., a filter could only allow a read to a certain buf
or of a certain size.  Allowing a downgrade would allow bypassing
those filters, whether they are the most sane things or not :)

But now you enforce that decision while the filter could make that
choice itself instead. If the filter doesn't allow read() and write(),
I really doubt it would allow prctl().

quoted

Out of curiosity, did you measure the kernel size differences before and
after these patches? Would be sad if sharing it with the networking code
didn't reduce the actual kernel size.

Oh yeah - it was a serious reduction.  Initially, seccomp_filter.o
added 8kb by itself.  With the merged seccomp.o, continued code
trimming (as suggested), and all the SECCOMP_RET_* variations, the
total kernel growth is 2972 bytes for the same kernel config.  This is
shared across ~2000 bytes in seccomp.o and ~800 bytes in filter.o.

Looks good, though the 800 extra bytes for filter.o seems high, it
used to be 292 bytes according to your email from January the 30th.
You said the run filter function added 861 bytes, so sharing doesn't
seem to reduce the kernel size any more?

Greetings,

Indan

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help