Re: [PATCH v2 0/8] Fork brute force attack mitigation

From: Kees Cook <hidden>
Date: 2020-11-11 00:11:00
Also in: linux-doc, lkml

On Sun, Oct 25, 2020 at 02:45:32PM +0100, John Wood wrote:

Attacks against vulnerable userspace applications with the purpose to break
ASLR or bypass canaries traditionaly use some level of brute force with the
help of the fork system call. This is possible since when creating a new
process using fork its memory contents are the same as those of the parent
process (the process that called the fork system call). So, the attacker
can test the memory infinite times to find the correct memory values or the
correct memory addresses without worrying about crashing the application.

Based on the above scenario it would be nice to have this detected and
mitigated, and this is the goal of this patch serie.

Thanks for preparing the v2! I spent some time looking at this today,
and I really like how it has been rearranged into an LSM. This feels
much more natural.

Various notes:

The locking isn't right; it'll trip with CONFIG_PROVE_LOCKING=y.
Here's the giant splat:


[    8.205146] brute: Fork brute force attack detected
[    8.206821] 
[    8.207317] =====================================================
[    8.209392] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[    8.211852] 5.10.0-rc3 #2 Not tainted
[    8.213215] -----------------------------------------------------
[    8.215213] runc/2505 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[    8.217387] ffffffff97206098 (tasklist_lock){.+.+}-{2:2}, at: brute_task_fatal_signal.cold+0x49/0xe0
[    8.219554] 
[    8.219554] and this task is already holding:
[    8.220891] ffff9a6a10eb8318 (&stats->lock){+.-.}-{2:2}, at: brute_task_fatal_signal+0x97/0x210
[    8.222856] which would create a new lock dependency:
[    8.223943]  (&stats->lock){+.-.}-{2:2} -> (tasklist_lock){.+.+}-{2:2}
[    8.225360] 
[    8.225360] but this new dependency connects a SOFTIRQ-irq-safe lock:
[    8.227081]  (&stats->lock){+.-.}-{2:2}
[    8.227084] 
[    8.227084] ... which became SOFTIRQ-irq-safe at:
[    8.228732]   lock_acquire+0x13f/0x3f0
[    8.229381]   _raw_spin_lock+0x2c/0x40
[    8.230024]   brute_task_free+0x22/0x90
[    8.230619]   security_task_free+0x22/0x50
[    8.231260]   __put_task_struct+0x58/0x140
[    8.231924]   rcu_core+0x2bb/0x620
[    8.232492]   __do_softirq+0x156/0x4d9
[    8.233080]   asm_call_irq_on_stack+0x12/0x20
[    8.233759]   do_softirq_own_stack+0x5b/0x70
[    8.234425]   irq_exit_rcu+0x9f/0xe0
[    8.235221]   sysvec_apic_timer_interrupt+0x43/0xa0
[    8.235976]   asm_sysvec_apic_timer_interrupt+0x12/0x20
[    8.236794]   _raw_write_unlock_irq+0x2c/0x40
[    8.237465]   copy_process+0x15e6/0x1cb0
[    8.237966]   kernel_clone+0x9b/0x3f0
[    8.238413]   kernel_thread+0x55/0x70
[    8.238861]   call_usermodehelper_exec_work+0x77/0xb0
[    8.239466]   process_one_work+0x23e/0x580
[    8.239969]   worker_thread+0x55/0x3c0
[    8.240425]   kthread+0x141/0x160
[    8.240833]   ret_from_fork+0x22/0x30
[    8.241276] 
[    8.241276] to a SOFTIRQ-irq-unsafe lock:
[    8.241939]  (tasklist_lock){.+.+}-{2:2}
[    8.241940] 
[    8.241940] ... which became SOFTIRQ-irq-unsafe at:
[    8.243180] ...
[    8.243182]   lock_acquire+0x13f/0x3f0
[    8.243853]   _raw_read_lock+0x5d/0x70
[    8.244307]   do_wait+0xd2/0x2e0
[    8.244706]   kernel_wait+0x49/0x90
[    8.245126]   call_usermodehelper_exec_work+0x61/0xb0
[    8.245748]   process_one_work+0x23e/0x580
[    8.246238]   worker_thread+0x55/0x3c0
[    8.246685]   kthread+0x141/0x160
[    8.247086]   ret_from_fork+0x22/0x30
[    8.247522] 
[    8.247522] other info that might help us debug this:
[    8.247522] 
[    8.248480]  Possible interrupt unsafe locking scenario:
[    8.248480] 
[    8.249289]        CPU0                    CPU1
[    8.249839]        ----                    ----
[    8.250386]   lock(tasklist_lock);
[    8.250811]                                local_irq_disable();
[    8.251525]                                lock(&stats->lock);
[    8.252227]                                lock(tasklist_lock);
[    8.252938]   <Interrupt>
[    8.253249]     lock(&stats->lock);
[    8.253663] 
[    8.253663]  *** DEADLOCK ***
[    8.253663] 
[    8.254362] 1 lock held by runc/2505:
[    8.254800]  #0: ffff9a6a10eb8318 (&stats->lock){+.-.}-{2:2}, at: brute_task_fatal_signal+0x97/0x210
[    8.255872] 
[    8.255872] the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
[    8.256931] -> (&stats->lock){+.-.}-{2:2} {
[    8.257432]    HARDIRQ-ON-W at:
[    8.257809]                     lock_acquire+0x13f/0x3f0
[    8.258442]                     _raw_spin_lock+0x2c/0x40
[    8.259069]                     brute_task_free+0x22/0x90
[    8.259710]                     security_task_free+0x22/0x50
[    8.260384]                     __put_task_struct+0x58/0x140
[    8.261078]                     rcu_core+0x2bb/0x620
[    8.261677]                     __do_softirq+0x156/0x4d9
[    8.262304]                     asm_call_irq_on_stack+0x12/0x20
[    8.263010]                     do_softirq_own_stack+0x5b/0x70
[    8.263699]                     irq_exit_rcu+0x9f/0xe0
[    8.264315]                     sysvec_apic_timer_interrupt+0x43/0xa0
[    8.265085]                     asm_sysvec_apic_timer_interrupt+0x12/0x20
[    8.265886]                     _raw_write_unlock_irq+0x2c/0x40
[    8.266580]                     copy_process+0x15e6/0x1cb0
[    8.267218]                     kernel_clone+0x9b/0x3f0
[    8.267832]                     kernel_thread+0x55/0x70
[    8.268444]                     call_usermodehelper_exec_work+0x77/0xb0
[    8.269234]                     process_one_work+0x23e/0x580
[    8.269912]                     worker_thread+0x55/0x3c0
[    8.270546]                     kthread+0x141/0x160
[    8.271144]                     ret_from_fork+0x22/0x30
[    8.271771]    IN-SOFTIRQ-W at:
[    8.272146]                     lock_acquire+0x13f/0x3f0
[    8.272780]                     _raw_spin_lock+0x2c/0x40
[    8.273398]                     brute_task_free+0x22/0x90
[    8.274022]                     security_task_free+0x22/0x50
[    8.274690]                     __put_task_struct+0x58/0x140
[    8.275358]                     rcu_core+0x2bb/0x620
[    8.275937]                     __do_softirq+0x156/0x4d9
[    8.276559]                     asm_call_irq_on_stack+0x12/0x20
[    8.277262]                     do_softirq_own_stack+0x5b/0x70
[    8.277948]                     irq_exit_rcu+0x9f/0xe0
[    8.278561]                     sysvec_apic_timer_interrupt+0x43/0xa0
[    8.279324]                     asm_sysvec_apic_timer_interrupt+0x12/0x20
[    8.280121]                     _raw_write_unlock_irq+0x2c/0x40
[    8.280845]                     copy_process+0x15e6/0x1cb0
[    8.281491]                     kernel_clone+0x9b/0x3f0
[    8.282117]                     kernel_thread+0x55/0x70
[    8.282734]                     call_usermodehelper_exec_work+0x77/0xb0
[    8.283514]                     process_one_work+0x23e/0x580
[    8.284181]                     worker_thread+0x55/0x3c0
[    8.284802]                     kthread+0x141/0x160
[    8.285381]                     ret_from_fork+0x22/0x30
[    8.285998]    INITIAL USE at:
[    8.286365]                    lock_acquire+0x13f/0x3f0
[    8.286975]                    _raw_spin_lock_irqsave+0x3b/0x60
[    8.287665]                    brute_share_stats+0x17/0x70
[    8.288311]                    brute_task_alloc+0x65/0x70
[    8.288948]                    security_task_alloc+0x44/0xd0
[    8.289605]                    copy_process+0x789/0x1cb0
[    8.290223]                    kernel_clone+0x9b/0x3f0
[    8.290819]                    kernel_thread+0x55/0x70
[    8.291421]                    rest_init+0x21/0x258
[    8.291995]                    start_kernel+0x566/0x587
[    8.292611]                    secondary_startup_64_no_verify+0xc2/0xcb
[    8.293384]  }
[    8.293586]  ... key      at: [<ffffffff984ce3a0>] __key.0+0x0/0x10
[    8.294324]  ... acquired at:
[    8.294679]    lock_acquire+0x13f/0x3f0
[    8.295133]    _raw_read_lock+0x5d/0x70
[    8.295589]    brute_task_fatal_signal.cold+0x49/0xe0
[    8.296185]    security_task_fatal_signal+0x22/0x30
[    8.296772]    get_signal+0x3e0/0xcf0
[    8.297212]    arch_do_signal+0x30/0x880
[    8.297682]    exit_to_user_mode_prepare+0xfc/0x170
[    8.298263]    syscall_exit_to_user_mode+0x38/0x240
[    8.298844]    entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    8.299464] 
[    8.299648] 
[    8.299648] the dependencies between the lock to be acquired
[    8.299648]  and SOFTIRQ-irq-unsafe lock:
[    8.300969] -> (tasklist_lock){.+.+}-{2:2} {
[    8.301481]    HARDIRQ-ON-R at:
[    8.301857]                     lock_acquire+0x13f/0x3f0
[    8.302482]                     _raw_read_lock+0x5d/0x70
[    8.303110]                     do_wait+0xd2/0x2e0
[    8.303677]                     kernel_wait+0x49/0x90
[    8.304266]                     call_usermodehelper_exec_work+0x61/0xb0
[    8.305038]                     process_one_work+0x23e/0x580
[    8.305700]                     worker_thread+0x55/0x3c0
[    8.306320]                     kthread+0x141/0x160
[    8.306895]                     ret_from_fork+0x22/0x30
[    8.307506]    SOFTIRQ-ON-R at:
[    8.307880]                     lock_acquire+0x13f/0x3f0
[    8.308505]                     _raw_read_lock+0x5d/0x70
[    8.309125]                     do_wait+0xd2/0x2e0
[    8.309689]                     kernel_wait+0x49/0x90
[    8.310272]                     call_usermodehelper_exec_work+0x61/0xb0
[    8.311044]                     process_one_work+0x23e/0x580
[    8.311709]                     worker_thread+0x55/0x3c0
[    8.312335]                     kthread+0x141/0x160
[    8.312910]                     ret_from_fork+0x22/0x30
[    8.313529]    INITIAL USE at:
[    8.313897]                    lock_acquire+0x13f/0x3f0
[    8.314515]                    _raw_write_lock_irq+0x34/0x50
[    8.315187]                    copy_process+0x11d5/0x1cb0
[    8.315834]                    kernel_clone+0x9b/0x3f0
[    8.316446]                    kernel_thread+0x55/0x70
[    8.317057]                    rest_init+0x21/0x258
[    8.317636]                    start_kernel+0x566/0x587
[    8.318250]                    secondary_startup_64_no_verify+0xc2/0xcb
[    8.319033]    INITIAL READ USE at:
[    8.319452]                         lock_acquire+0x13f/0x3f0
[    8.320117]                         _raw_read_lock+0x5d/0x70
[    8.320777]                         do_wait+0xd2/0x2e0
[    8.321387]                         kernel_wait+0x49/0x90
[    8.322031]                         call_usermodehelper_exec_work+0x61/0xb0
[    8.322852]                         process_one_work+0x23e/0x580
[    8.323557]                         worker_thread+0x55/0x3c0
[    8.324220]                         kthread+0x141/0x160
[    8.324837]                         ret_from_fork+0x22/0x30
[    8.325490]  }
[    8.325694]  ... key      at: [<ffffffff97206098>] tasklist_lock+0x18/0x40
[    8.326509]  ... acquired at:
[    8.326871]    lock_acquire+0x13f/0x3f0
[    8.327330]    _raw_read_lock+0x5d/0x70
[    8.327790]    brute_task_fatal_signal.cold+0x49/0xe0
[    8.328392]    security_task_fatal_signal+0x22/0x30
[    8.328958]    get_signal+0x3e0/0xcf0
[    8.329388]    arch_do_signal+0x30/0x880
[    8.329855]    exit_to_user_mode_prepare+0xfc/0x170
[    8.330434]    syscall_exit_to_user_mode+0x38/0x240
[    8.331010]    entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    8.331622] 
[    8.331801] 
[    8.331801] stack backtrace:
[    8.332322] CPU: 2 PID: 2505 Comm: runc Not tainted 5.10.0-rc3 #2
[    8.333037] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[    8.334088] Call Trace:
[    8.334385]  dump_stack+0x77/0x97
[    8.334779]  check_irq_usage.cold+0x279/0x2ec
[    8.335292]  ? check_noncircular+0x75/0x110
[    8.335783]  __lock_acquire+0x12e0/0x2590
[    8.336256]  lock_acquire+0x13f/0x3f0
[    8.336696]  ? brute_task_fatal_signal.cold+0x49/0xe0
[    8.337284]  _raw_read_lock+0x5d/0x70
[    8.337721]  ? brute_task_fatal_signal.cold+0x49/0xe0
[    8.338315]  brute_task_fatal_signal.cold+0x49/0xe0
[    8.338885]  security_task_fatal_signal+0x22/0x30
[    8.339435]  get_signal+0x3e0/0xcf0
[    8.339849]  arch_do_signal+0x30/0x880
[    8.340290]  ? rcu_read_lock_sched_held+0x3f/0x70
[    8.340853]  ? kfree+0x25d/0x2c0
[    8.341241]  exit_to_user_mode_prepare+0xfc/0x170
[    8.341796]  syscall_exit_to_user_mode+0x38/0x240
[    8.342350]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    8.342945] RIP: 0033:0x461923
[    8.343310] Code: 24 20 c3 cc cc cc cc 48 8b 7c 24 08 8b 74 24 10 8b 54 24 14 4c 8b 54 24 18 4c 8b 44 24 20 44 8b 4c 24 28 b8 ca 00 00 00 0f 05 <89> 44 24 30 c3 cc cc cc cc cc cc cc cc 8b 7c 24 08 48 8b 74 24 10
[    8.345488] RSP: 002b:00007fd62affcda8 EFLAGS: 00000286 ORIG_RAX: 00000000000000ca
[    8.346385] RAX: fffffffffffffe00 RBX: 000000c000032e00 RCX: 0000000000461923
[    8.347225] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000000b3b898
[    8.348071] RBP: 00007fd62affcdf0 R08: 0000000000000000 R09: 0000000000000000
[    8.349006] R10: 0000000000000000 R11: 0000000000000286 R12: 000000c000001680
[    8.349838] R13: 00007ffd85b25fcf R14: 00007ffd85b260d0 R15: 00007fd62affcfc0

I think it should be possible to using existing task locking semantics
to manage the statistics, but I'll need to take a closer look.

Other implementations
---------------------

The public version of grsecurity, as a summary, is based on the idea of
delay the fork system call if a child died due to a fatal error. This has
some issues:

1.- Bad practices: Add delays to the kernel is, in general, a bad idea.

2.- Weak points: This protection can be bypassed using two different
    methods since it acts only when the fork is called after a child has
    crashed.

    2.1.- Bypass 1: So, it would still be possible for an attacker to fork
          a big amount of children (in the order of thousands), then probe
          all of them, and finally wait the protection time before repeat
          the steps.

    2.2.- Bypass 2: This method is based on the idea that the protection
          doesn't act if the parent crashes. So, it would still be possible
          for an attacker to fork a process and probe itself. Then, fork
          the child process and probe itself again. This way, these steps
          can be repeated infinite times without any mitigation.

It's good to clarify what the expected behaviors should be; however,
while working with the resulting system, it wasn't clear what the threat
model was for this defense. I think we need two things: clear
descriptions of what is expected to be detected (and what is not), and a
set of self-tests that can be used to validate those expectations.

Also, what, specifically, does "fatal error" cover? Is it strictly fatal
signals? (i.e. "error" might refer to exit code, for example.)

This implementation
-------------------

The main idea behind this implementation is to improve the existing ones
focusing on the weak points annotated before. The solution for the first
bypass method is to detect a fast crash rate instead of only one simple
crash. For the second bypass method the solution is to detect both the
crash of parent and child processes. Moreover, as a mitigation method it is
better to kill all the offending tasks involve in the attack instead of use
delays.

So, the solution to the two bypass methods previously commented is to use
some statistical data shared across all the processes that can have the
same memory contents. Or in other words, a statistical data shared between
all the fork hierarchy processes after an execve system call.

Hm, is this already tracked in some other way? i.e. the family hierarchy
of the mm struct? They're only shared on clone, but get totally copied on
fork(). Should that be the place where this is tracked instead? (i.e. I
could fork and totally rearrange my VMAs.)

The purpose of these statistics is to compute the application crash period
in order to detect an attack. This crash period is the time between the
execve system call and the first fault or the time between two consecutives
faults, but this has a drawback. If an application crashes once quickly
from the execve system call or crashes twice in a short period of time for
some reason, a false positive attack will be triggered. To avoid this
scenario the shared statistical data holds a list of the i last crashes
timestamps and the application crash period is computed as follows:

crash_period = (n_last_timestamp - n_minus_i_timestamp) / i;

This ways, the size of the last crashes timestamps list allows to fine
tuning the detection sensibility.

Instead of a list, can't the rate just be calculated on an on-going
basis?

When this crash period falls under a certain threshold there is a clear
signal that something malicious is happening. Once detected, the mitigation
only kills the processes that share the same statistical data and so, all
the tasks that can have the same memory contents. This way, an attack is
rejected.

Here's where I think the threat model needs some more work. The above
describes what I think is a less common situation. I expect the common
attack to hold still with a single value, and let the fork/exec spin
until the value lines up. (i.e. a fork is required.)

Here are the threat scenarios that come to mind for me:

1- launching (fork/exec) a setuid process repeatedly until you get a
   desirable memory layout (e.g. what Stack Clash[1] did).

2- connecting to an exec()ing network daemon (e.g. xinetd) repeatedly
   until you get a desirable memory layout (e.g. what CTFs do for simple
   network service[2]).

3- launching processes _without_ exec (e.g. Android Zygote[3]), and
   exposing state to attack a sibling.

4- connecting to a fork()ing network daemon (e.g. apache) repeatedly
   until you expose the previously-shared memory layout of all the other
   children (e.g. kind of related to HeartBleed[3], though that was a
   direct exposure not a crasher).

In each case, a privilege boundary is being crossed (setuid in the first,
priv-changes in the 2nd, and network-to-local in the latter two), so I
suspect that kind of detail will need to play a part in the design to
help avoid false positives.

Regardless, when I tested this series, 1 and 3 isn't detected, since
they pass through an execve(), and I think that needs to be covered as
well.

[1] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[2] https://github.com/BSidesPDX/CTF-2017/blob/master/pwn/200-leek/src/leek.service
[3] https://link.springer.com/article/10.1007/s10207-018-00425-8
[4] https://heartbleed.com/

1.- Per system enabling: This feature can be enabled at build time using
    the CONFIG_SECURITY_FORK_BRUTE option or using the visual config
    application under the following menu:

    Security options ---> Fork brute force attack detection and mitigation

(there is a built-in boot time disabling too, by changing the lsm=
bootparam)

2.- Per process enabling/disabling: To allow that specific applications can
    turn off or turn on the detection and mitigation of a fork brute force
    attack when required, there are two new prctls.

    prctl(PR_SECURITY_FORK_BRUTE_ENABLE, 0, 0, 0, 0)
    prctl(PR_SECURITY_FORK_BRUTE_DISABLE, 0, 0, 0, 0)

How do you see this being used?

3.- Fine tuning: To customize the detection's sensibility there are two new
    sysctl attributes that allow to set the last crashes timestamps list
    size and the application crash period threshold (in milliseconds). Both
    are accessible through the following files respectively.

    /proc/sys/kernel/brute/timestamps_list_size
    /proc/sys/kernel/brute/crash_period_threshold

    The list size allows to avoid false positives due to crashes unrelated
    with a real attack. The period threshold sets the time limit to detect
    an attack. And, since a fork brute force attack will be detected if the
    application crash period falls under this threshold, the higher this
    value, the more sensitive the detection will be.

I wonder if these will be needed as we narrow in on the specific threat
model (i.e. there will be enough additional signal to obviate this
tuning).

And in testing I found another false positive that I haven't fully
diagnosed. I found that at boot, with Docker installed, "runc" would
immediately trip the mitigation. With some debugging added, I looks like
runc had several forked processes that got SIGKILLed in quick succession,
and then the entire group got killed by Brute. I haven't narrowed down
what runc is doing here, but it makes me wonder if there might need
to be an exception for user-space delivered signals, as opposed to
kernel-delivered signals...

Thanks again for the work! I'm liking the idea of getting a solid
protection for this. It's been a long-standing hole in upstream. :)

-Kees

-- 
Kees Cook

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help