Re: [PATCH v1] proc: Implement /proc/self/meminfo

From: Shakeel Butt <hidden>
Date: 2021-06-18 23:39:04
Also in: cgroups, linux-fsdevel, lkml

Possibly related (same subject, not in this thread)

2021-06-16 · Re: [PATCH v1] proc: Implement /proc/self/meminfo · Shakeel Butt <hidden>
2021-06-15 · Re: [PATCH v1] proc: Implement /proc/self/meminfo · Alexey Gladkov <legion@kernel.org>
2021-06-15 · Re: [PATCH v1] proc: Implement /proc/self/meminfo · Christian Brauner <hidden>
2021-06-10 · Re: [PATCH v1] proc: Implement /proc/self/meminfo · Daniel Walsh <hidden>
2021-06-09 · Re: [PATCH v1] proc: Implement /proc/self/meminfo · Eric W. Biederman <hidden>

On Wed, Jun 16, 2021 at 9:17 AM Eric W. Biederman [off-list ref] wrote:

Shakeel Butt [off-list ref] writes:

quoted

On Tue, Jun 15, 2021 at 5:47 AM Alexey Gladkov [off-list ref] wrote:

quoted

[...]

quoted

I made the second version of the patch [1], but then I had a conversation
with Eric W. Biederman offlist. He convinced me that it is a bad idea to
change all the values in meminfo to accommodate cgroups. But we agreed
that MemAvailable in /proc/meminfo should respect cgroups limits. This
field was created to hide implementation details when calculating
available memory. You can see that it is quite widely used [2].
So I want to try to move in that direction.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/legion/linux.git/log/?h=patchset/meminfo/v2.0
[2] https://codesearch.debian.net/search?q=MemAvailable%3A

Please see following two links on the previous discussion on having
per-memcg MemAvailable stat.

[1] https://lore.kernel.org/linux-mm/alpine.DEB.2.22.394.2006281445210.855265@chino.kir.corp.google.com/ (local)
[2] https://lore.kernel.org/linux-mm/alpine.DEB.2.23.453.2007142018150.2667860@chino.kir.corp.google.com/ (local)

MemAvailable itself is an imprecise metric and involving memcg makes
this metric even more weird. The difference of semantics of swap
accounting of v1 and v2 is one source of this weirdness (I have not
checked your patch if it is handling this weirdness). The lazyfree and
deferred split pages are another source.

So, I am not sure if complicating an already imprecise metric will
make it more useful.

Making a good guess at how much memory can be allocated without
triggering swapping or otherwise stressing the system is something that
requires understanding our mm internals.

To be able to continue changing the mm or even mm policy without
introducing regressions in userspace we need to export values that
userspace can use.

The issue is the dependence of such exported values on mm internals.
MM internal code and policy changes will change this value and there
is a potential of userspace regression.

At a first approximation that seems to look like MemAvailable.

MemAvailable seems to have a good definition.  Roughly the amount of
memory that can be allocated without triggering swapping.

Nowadays, I don't think MemAvailable giving "amount of memory that can
be allocated without triggering swapping" is even roughly accurate.
Actually IMO "without triggering swap" is not something an application
should concern itself with where refaults from some swap types
(zswap/swap-on-zram) are much faster than refaults from disk.

Updated
to include not trigger memory cgroup based swapping and I sounds good.

I don't know if it will work in practice but I think it is worth
exploring.

I agree.

I do know that hiding the implementation details and providing userspace
with information it can directly use seems like the programming model
that needs to be explored.  Most programs should not care if they are in
a memory cgroup, etc.  Programs, load management systems, and even
balloon drivers have a legitimately interest in how much additional load
can be placed on a systems memory.

How much additional load can be placed on a system *until what*. I
think we should focus more on the "until" part to make the problem
more tractable.

A version of this that I remember working fairly well is free space
on compressed filesystems.  As I recall compressed filesystems report
the amount of uncompressed space that is available (an underestimate).
This results in the amount of space consumed going up faster than the
free space goes down.

We can't do exactly the same thing with our memory usability estimate,
but having our estimate be a reliable underestimate might be enough
to avoid problems with reporting too much memory as available to
userspace.

I know that MemAvailable already does that /2 so maybe it is already
aiming at being an underestimate.  Perhaps we need some additional
accounting to help create a useful metric for userspace as well.

The real challenge here is that we are not 100% sure if a page is
reclaimable until we try to reclaim it. For example we might have file
lrus filled with lazyfree pages which might have been accessed.
MemAvailable will show half the size of file lrus but once we try to
reclaim them, we have to move them back to anon lru and drastic drop
in MemAvailable.

I don't know the final answer.  I do know that not designing an
interface that userspace can use to deal with it's legitimate concerns
is sticking our collective heads in the sand and wishing the problem
will go away.

I am a bit skeptical that a single interface would be enough but first
we should formalize what exactly the application wants with some
concrete use-cases. More specifically, are the applications interested
in avoiding swapping or OOM or stall?

Second, is the reactive approach acceptable? Instead of an upfront
number representing the room for growth, how about just grow and
backoff when some event (oom or stall) which we want to avoid is about
to happen? This is achievable today for oom and stall with PSI and
memory.high and it avoids the hard problem of reliably estimating the
reclaimable memory.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help