How should we handle variable address space sizes (Re: [RFC 3/4] x86/mm: define TASK_SIZE as current->mm->task_size)
From: hpa@zytor.com (hpa at zytor.com)
Date: 2017-01-02 23:08:56
Also in:
linux-api, linux-arch, linux-s390, lkml
On January 2, 2017 8:52:41 AM PST, Andy Lutomirski [off-list ref] wrote:
On Mon, Jan 2, 2017 at 1:49 AM, Kirill A. Shutemov [off-list ref] wrote:quoted
On Fri, Dec 30, 2016 at 06:11:05PM -0800, Andy Lutomirski wrote:quoted
On Fri, Dec 30, 2016 at 7:56 AM, Dmitry Safonov[off-list ref] wrote:quoted
quoted
quoted
Keep task's virtual address space size as mm_struct field which exists for a long time - it's initialized in setup_new_exec() depending on the new task's personality. This way TASK_SIZE will always be the same ascurrent->mm->task_size.quoted
quoted
quoted
Previously, there could be an issue about different values of TASK_SIZE and current->mm->task_size: e.g, a 32-bit process canunsetquoted
quoted
quoted
ADDR_LIMIT_3GB personality (with personality syscall) and so TASK_SIZE will be 4Gb, which is larger than mm->task_size =3Gb.quoted
quoted
quoted
As TASK_SIZE *and* current->mm->task_size are used both in code frequently, this difference creates a subtle situations, forexample:quoted
quoted
quoted
one can mmap addresses > 3Gb, but they will be hidden in /proc/pid/pagemap as it checks mm->task_size. I've moved initialization of mm->task_size earlier insetup_new_exec()quoted
quoted
quoted
as arch_pick_mmap_layout() initializes mmap_legacy_base with TASK_UNMAPPED_BASE, which depends on TASK_SIZE.I don't like this patch so much because I think that we shouldfigurequoted
quoted
out how this will all work in the long run first. I've added some more people to the thread because other arches have similar issuesandquoted
quoted
because x86 is about to get considerably more complicated (choices include 3GB, 4GB, 47-bit, and 56-bit (the latter IIRC)). Here are a few of my thoughts on the matter. This isn't all thatwellquoted
quoted
thought out: The address space limit, especially if CRIU is in play, isn't reallyaquoted
quoted
hard limit. For example, you could allocate high memory then lower the limit. Similarly, I see no reason that an x32 program should be forbidden from mapping some high addresses or, similarly, that ani386quoted
quoted
program can't (if it really wanted to) do a 64-bit mmap() and get a high address. On that note, can we just *delete* the task_size check from pagemap? It's been there since the very beginning: commit 85863e475e59afb027b0113290e3796ee6020b7d Author: Matt Mackall [off-list ref] Date: Mon Feb 4 22:29:04 2008 -0800 maps4: add /proc/pid/pagemap interface and there's no explanation for why it's needed. So maybe we should have a *number* (not a bit) that indicates the maximum address that mmap() will return unless an override is inuse.quoted
quoted
Since common practice seems to be to stick this in the personality field, we may need some fancy encoding. Executing a setuid binary needs to reset to the default, and personality handles that.If we want to be able to specify arbitrary address as maximum, afancyquoted
encoding would need to claim 51 bits (63 VA - 12 in-page address) onx86quoted
from the persona flag. To me, it's stretching personality interface too far. Maybe it's easier to reset the rlimit for suid binaries?I guess I don't see why rlimit makes any sense, though. It's not a resource utilization control, hard vs soft limits make very little sense, requiring capabilities to exceed the hard limit doesn't help anything, and it's only useful to preserve it across execve() to work around bugs. So if it's going to be a number, let's just make it be a new number with a new API to control it. --Andy
It's an API that already exists, that's the plus. Hard and soft limits *do* make sense IMO. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.