Thread (58 messages) 58 messages, 12 authors, 2021-11-30

Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB

From: Johannes Weiner <hannes@cmpxchg.org>
Date: 2021-11-17 22:26:42
Also in: io-uring, linux-mm, lkml

On Mon, Nov 15, 2021 at 08:35:30PM -0800, Andrew Morton wrote:
On Sat, 6 Nov 2021 14:12:45 +0700 Ammar Faizi [off-list ref] wrote:
quoted
On 11/6/21 2:05 PM, Drew DeVault wrote:
quoted
Should I send a v2 or is this email sufficient:

Signed-off-by: Drew DeVault <redacted>
Oops, I missed akpm from the CC list. Added Andrew.

Cc: Andrew Morton <akpm@linux-foundation.org>
Ref: https://lore.kernel.org/io-uring/CFII8LNSW5XH.3OTIVFYX8P65Y@taiga/ (local)
Let's cc linux-mm as well.


Unfortunately I didn't know about this until Nov 4, which was formally
too late for 5.16.  I guess I could try to sneak it past Linus if
someone were to send me some sufficiently convincing words explaining
the urgency.

I'd also be interested in seeing feedback from the MM developers.

And a question: rather than messing around with a constant which will
need to be increased again in a couple of years, can we solve this one
and for all?  For example, permit root to set the system-wide
per-process max mlock size and depend upon initscripts to do this
appropriately.
My take is that as long as the kernel sets some limit per default on
this at all, it should be one that works for common workloads. Today
this isn't the case.

We've recently switched our initscripts at FB to set the default to
0.1% of total RAM. The impetus for this was a subtle but widespread
issue where we failed to mmap the PERF_COUNT_SW_TASK_CLOCK event
counter (perf event mmap also uses RLIMIT_MEMLOCK!) and silently fell
back to the much less efficient clock_gettime() syscall.

Because the failure mode was subtle and annoying, we didn't just want
to raise the limit, but raise it so that no reasonable application
would run into it, and only buggy or malicious ones would.

And IMO, that's really what rlimits should be doing: catching clearly
bogus requests, not trying to do fine-grained resource control. For
more reasonable overuse that ends up causing memory pressure, the OOM
killer will do the right thing since the pages still belong to tasks.

So 0.1% of the machine seemed like a good default formula for
that. And it would be a bit more future proof too.

On my 32G desktop machine, that would be 32M. For comparison, the
default process rlimit on that machine is ~120k, which comes out to
~2G worth of kernel stack, which also isn't reclaimable without OOM...
From: Drew DeVault <redacted>
Subject: Increase default MLOCK_LIMIT to 8 MiB

This limit has not been updated since 2008, when it was increased to 64
KiB at the request of GnuPG.  Until recently, the main use-cases for this
feature were (1) preventing sensitive memory from being swapped, as in
GnuPG's use-case; and (2) real-time use-cases.  In the first case, little
memory is called for, and in the second case, the user is generally in a
position to increase it if they need more.

The introduction of IOURING_REGISTER_BUFFERS adds a third use-case:
preparing fixed buffers for high-performance I/O.  This use-case will take
as much of this memory as it can get, but is still limited to 64 KiB by
default, which is very little.  This increases the limit to 8 MB, which
was chosen fairly arbitrarily as a more generous, but still conservative,
default value.

It is also possible to raise this limit in userspace.  This is easily
done, for example, in the use-case of a network daemon: systemd, for
instance, provides for this via LimitMEMLOCK in the service file; OpenRC
via the rc_ulimit variables.  However, there is no established userspace
facility for configuring this outside of daemons: end-user applications do
not presently have access to a convenient means of raising their limits.

The buck, as it were, stops with the kernel.  It's much easier to address
it here than it is to bring it to hundreds of distributions, and it can
only realistically be relied upon to be high-enough by end-user software
if it is more-or-less ubiquitous.  Most distros don't change this
particular rlimit from the kernel-supplied default value, so a change here
will easily provide that ubiquity.

Link: https://lkml.kernel.org/r/20211028080813.15966-1-sir@cmpwn.com
Signed-off-by: Drew DeVault <redacted>
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Cyril Hrubis <chrubis@suse.cz>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>

As per above, I think basing it off of RAM size would be better, but
this increase is overdue given all the new users beyond mlock(), and
8M is much better than the current value.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help