[kernel-hardening] Re: [PATCH v9 1/4] syscalls: Verify address limit before returning to user-mode
From: Kees Cook <hidden>
Date: 2017-05-09 16:30:07
Also in:
linux-api, linux-s390, lkml
On Mon, May 8, 2017 at 11:56 PM, Ingo Molnar [off-list ref] wrote:
* Kees Cook [off-list ref] wrote:quoted
quoted
There's the option of using GCC plugins now that the infrastructure was upstreamed from grsecurity. It can be used as part of the regular build process and as long as the analysis is pretty simple it shouldn't hurt compile time much.Well, and that the situation may arise due to memory corruption, not from poorly-matched set_fs() calls, which static analysis won't help solve. We need to catch this bad kernel state because it is a very bad state to run in.
[attempting some thread-merging]
Ok, so that's CVE-2010-4258, where an oops with KERNEL_DS set was used to escalate privileges, due to the kernel's oops handler not cleaning up the KERNEL_DS. The exploit used another bug, a crash in a network protocol handler, to execute the oops handler with KERNEL_DS set.
Right, I didn't mean to suggest that vulnerability would be fixed by this solution. I was trying to show how there can be some pretty complex interaction with exceptions/interrupts/etc that would make pure static analysis still miss things.
If memory corruption corrupted the task state into having addr_limit set to KERNEL_DS then there's already a fair chance that it's game over: it could also have set *uid to 0, or changed a sensitive PF_ flag, or a number of other things... Furthermore, think about it: there's literally an infinite amount of corrupted task states that could be a security problem and that could be checked after every system call. Do we want to check every one of them?
Right, but this "slippery slope" argument isn't the best way to reject security changes. Let me take a step back and describe the threat, and where we should likely spend time: The primary threat with addr_limit getting changed is that a narrowly-scoped attack (traditionally stack exhaustion or adjacent-stack large-index writes) could be leveraged into opening the entire kernel to writes (by allowing all syscalls with a copy_to_user() call to suddenly be able to write to kernel memory). So, really, the flaw is having addr_limit at all. Removing set_fs() should, I think, allow this to become a const (or at least should get us a lot closer). The main path to corrupting addr_limit has been via stack corruption. On architectures with CONFIG_THREAD_INFO_IN_TASK, this risk is greatly reduced already, but it's not universally available yet. (And as long as we're talking about stack attacks, CONFIG_VMAP_STACK makes cross-stack overflows go away, and cross-stack indexing harder, but that's not really about addr_limit since currently nothing with VMAP_STACK doesn't already have THREAD_INFO_IN_TASK.) So, left with a still exploitable target in memory that allows such an expansion of attack method, I still think it's worth keeping this patch series, but if we can drop set_fs() I could probably be convinced the benefit of the series doesn't exceed the cost on THREAD_INFO_IN_TASK-architectures (x86, arm64, s390). But that means at least currently keeping it on arm, for example. If we can make addr_limit const, well, we don't need the series at all. -Kees -- Kees Cook Pixel Security