Re: [RFC] Handle mapcount overflows

From: Jann Horn <jannh@google.com>
Date: 2019-05-01 14:41:40
Also in: lkml

[extremely slow reply]

On Fri, Mar 2, 2018 at 4:26 PM Matthew Wilcox [off-list ref] wrote:

Here's my third effort to handle page->_mapcount overflows.

The idea is to minimise overhead, so we keep a list of users with more
than 5000 mappings.  In order to overflow _mapcount, you have to have
2 billion mappings, so you'd need 400,000 tasks to evade the tracking,
and your sysadmin has probably accused you of forkbombing the system
long before then.  Not to mention the 6GB of RAM you consumed just in
stacks and the 24GB of RAM you consumed in page tables ... but I digress.

Let's assume that the sysadmin has increased the number of processes to
100,000.  You'd need to create 20,000 mappings per process to overflow
_mapcount, and they'd end up on the 'heavy_users' list.  Not everybody
on the heavy_users list is going to be guilty, but if we hit an overflow,
we look at everybody on the heavy_users list and if they've got the page
mapped more than 1000 times, they get a SIGSEGV.

I'm not entirely sure how to forcibly tear down a task's mappings, so
I've just left a comment in there to do that.  Looking for feedback on
this approach.

[...]

quoted hunk ↗ jump to hunk

diff --git a/mm/mmap.c b/mm/mmap.c
index 9efdc021ad22..575766ec02f8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c

[...]

+static void kill_mm(struct task_struct *tsk)
+{
+       /* Tear down the mappings first */
+       do_send_sig_info(SIGKILL, SEND_SIG_FORCED, tsk, true);
+}

The mapping teardown could maybe be something like
unmap_mapping_range_vma()? That doesn't remove the VMA, but it gets
rid of the PTEs; and it has the advantage of working without taking
the mmap_sem. And then it isn't even necessarily required to actually
kill the abuser; instead, the abuser would just take a minor fault on
the next access, and the abusers would take away each other's
references, slowing each other down.

+static void kill_abuser(struct mm_struct *mm)
+{
+       struct task_struct *tsk;
+
+       for_each_process(tsk)
+               if (tsk->mm == mm)
+                       break;

(There can be multiple processes sharing the ->mm.)

+       if (down_write_trylock(&mm->mmap_sem)) {
+               kill_mm(tsk);
+               up_write(&mm->mmap_sem);
+       } else {
+               do_send_sig_info(SIGKILL, SEND_SIG_FORCED, tsk, true);
+       }

Hmm. Having to fall back if the lock is taken here is kind of bad, I
think. __get_user_pages_locked() with locked==NULL can keep the
mmap_sem blocked arbitrarily long, meaning that an attacker could
force the fallback path, right? For example, __access_remote_vm() uses
get_user_pages_remote() with locked==NULL. And IIRC you can avoid
getting killed by a SIGKILL by being stuck in unkillable disk sleep,
which I think FUSE can create by not responding to a request.

+}
+
+void mm_mapcount_overflow(struct page *page)
+{
+       struct mm_struct *entry = current->mm;
+       unsigned int id;
+       struct vm_area_struct *vma;
+       struct address_space *mapping = page_mapping(page);
+       unsigned long pgoff = page_to_pgoff(page);
+       unsigned int count = 0;
+
+       vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff + 1) {

I think this needs the i_mmap_rwsem?

+               if (vma->vm_mm == entry)
+                       count++;
+               if (count > 1000)
+                       kill_mm(current);
+       }
+
+       rcu_read_lock();
+       idr_for_each_entry(&heavy_users, entry, id) {
+               count = 0;
+
+               vma_interval_tree_foreach(vma, &mapping->i_mmap,
+                               pgoff, pgoff + 1) {
+                       if (vma->vm_mm == entry)
+                               count++;
+                       if (count > 1000) {
+                               kill_abuser(entry);
+                               goto out;

Even if someone has 1000 mappings of the range in question, that
doesn't necessarily mean that there are actually any non-zero PTEs in
the abuser. This probably needs to get some feedback from
kill_abuser() to figure out whether at least one reference has been
reclaimed.

+                       }
+               }
+       }
+       if (!entry)
+               panic("No abusers found but mapcount exceeded\n");
+out:
+       rcu_read_unlock();
+}

[...]

quoted hunk ↗ jump to hunk

@@ -1357,6 +1466,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
        /* Too many mappings? */
        if (mm->map_count > sysctl_max_map_count)
                return -ENOMEM;
+       if (mm->map_count > mm_track_threshold)
+               mmap_track_user(mm, mm_track_threshold);

I think this check would have to be copied to a few other places;
AFAIK you can e.g. use a series of mremap() calls to create multiple
mappings of the same file page. Something like:

char *addr = mmap(0x100000000, 0x1000, PROT_READ, MAP_SHARED, fd, 0);
for (int i=0; i<1000; i++) {
  mremap(addr, 0x1000, 0x2000, 0);
  mremap(addr+0x1000, 0x1000, 0x1000, MREMAP_FIXED|MREMAP_MAYMOVE,
0x200000000 + i * 0x1000);
}

quoted hunk ↗ jump to hunk

        /* Obtain the address to map to. we verify (or select) it and ensure
         * that it represents a valid section of the address space.

@@ -2997,6 +3108,8 @@ void exit_mmap(struct mm_struct *mm)
        /* mm's last user has gone, and its about to be pulled down */
        mmu_notifier_release(mm);

+       mmap_untrack_user(mm);
+
        if (mm->locked_vm) {
                vma = mm->mmap;
                while (vma) {

I'd move that call further down, to reduce the chance that the task
blocks after being untracked but before actually dropping its
references.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help