Thread (30 messages) 30 messages, 11 authors, 2024-10-24

Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

From: Steven Price <steven.price@arm.com>
Date: 2024-10-21 13:23:11
Also in: linux-alpha, linux-arch, linux-kselftest, linux-mips, linux-mm, linux-s390, linux-sh, lkml, loongarch, sparclinux

On 09/09/2024 10:46, Kirill A. Shutemov wrote:
On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
quoted
On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
quoted
On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
quoted
Some applications rely on placing data in free bits addresses allocated
by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the
address returned by mmap to be less than the 48-bit address space,
unless the hint address uses more than 47 bits (the 48th bit is reserved
for the kernel address space).

The riscv architecture needs a way to similarly restrict the virtual
address space. On the riscv port of OpenJDK an error is thrown if
attempted to run on the 57-bit address space, called sv57 [1].  golang
has a comment that sv57 support is not complete, but there are some
workarounds to get it to mostly work [2].
I also saw libmozjs crashing with 57-bit address space on x86.
quoted
quoted
quoted
These applications work on x86 because x86 does an implicit 47-bit
restriction of mmap() address that contain a hint address that is less
than 48 bits.

Instead of implicitly restricting the address space on riscv (or any
current/future architecture), a flag would allow users to opt-in to this
behavior rather than opt-out as is done on other architectures. This is
desirable because it is a small class of applications that do pointer
masking.
You reiterate the argument about "small class of applications". But it
makes no sense to me.
Sorry to chime in late on this - I had been considering implementing
something like MAP_BELOW_HINT and found this thread.

While the examples of applications that want to use high VA bits and get
bitten by future upgrades is not very persuasive. It's worth pointing
out that there are a variety of somewhat horrid hacks out there to work
around this feature not existing.

E.g. from my brief research into other code:

  * Box64 seems to have a custom allocator based on reading 
    /proc/self/maps to allocate a block of VA space with a low enough 
    address [1]

  * PHP has code reading /proc/self/maps - I think this is to find a 
    segment which is close enough to the text segment [2]

  * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
    addresses [3][4]

  * pmdk has some funky code to find the lowest address that meets 
    certain requirements - this does look like an ALSR alternative and 
    probably couldn't directly use MAP_BELOW_HINT, although maybe this 
    suggests we need a mechanism to map without a VA-range? [5]

  * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
    a range [6]

  * LuaJIT uses an approach to 'probe' to find a suitable low address 
    for allocation [7]

The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
library to get low addresses without causing any problems for the rest
of the application. The use case I'm looking at is in a library and 
therefore a personality mode wouldn't be appropriate (because I don't 
want to affect the rest of the application). Reading /proc/self/maps
is also problematic because other threads could be allocating/freeing
at the same time.

Thanks,
Steve


[1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/
[2] https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap.c/#L62
[3] https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp
[4] https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb05c
[5] https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=29#L29
[6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826
[7] https://sources.debian.org/src/luajit/2.1.0+openresty20240815-1/src/lj_alloc.c/
With full address space by default, this small class of applications is
going to *broken* unless they would handle RISC-V case specifically.

On other hand, if you limit VA to 128TiB by default (like many
architectures do[1]) everything would work without intervention.
And if an app needs wider address space it would get it with hint opt-in,
because it is required on x86-64 anyway. Again, no RISC-V-specific code.

I see no upside with your approach. Just worse user experience.

[1] See va_high_addr_switch test case in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/mm/Makefile#n115
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help