Re: [RFC Proposal] Deterministic memcg charging for shared memory

From: Mina Almasry <hidden>
Date: 2021-10-18 14:32:12
Also in: linux-fsdevel, linux-mm

On Mon, Oct 18, 2021 at 6:33 AM Michal Hocko [off-list ref] wrote:

On Wed 13-10-21 12:23:19, Mina Almasry wrote:

quoted

Below is a proposal for deterministic charging of shared memory.
Please take a look and let me know if there are any major concerns:

Problem:
Currently shared memory is charged to the memcg of the allocating
process. This makes memory usage of processes accessing shared memory
a bit unpredictable since whichever process accesses the memory first
will get charged. We have a number of use cases where our userspace
would like deterministic charging of shared memory:

1. System services allocating memory for client jobs:
We have services (namely a network access service[1]) that provide
functionality for clients running on the machine and allocate memory
to carry out these services. The memory usage of these services
depends on the number of jobs running on the machine and the nature of
the requests made to the service, which makes the memory usage of
these services hard to predict and thus hard to limit via memory.max.
These system services would like a way to allocate memory and instruct
the kernel to charge this memory to the client’s memcg.

2. Shared filesystem between subtasks of a large job
Our infrastructure has large meta jobs such as kubernetes which spawn
multiple subtasks which share a tmpfs mount. These jobs and its
subtasks use that tmpfs mount for various purposes such as data
sharing or persistent data between the subtask restarts. In kubernetes
terminology, the meta job is similar to pods and subtasks are
containers under pods. We want the shared memory to be
deterministically charged to the kubernetes's pod and independent to
the lifetime of containers under the pod.

3. Shared libraries and language runtimes shared between independent jobs.
We’d like to optimize memory usage on the machine by sharing libraries
and language runtimes of many of the processes running on our machines
in separate memcgs. This produces a side effect that one job may be
unlucky to be the first to access many of the libraries and may get
oom killed as all the cached files get charged to it.

Design:
My rough proposal to solve this problem is to simply add a
‘memcg=/path/to/memcg’ mount option for filesystems (namely tmpfs):
directing all the memory of the file system to be ‘remote charged’ to
cgroup provided by that memcg= option.

Could you be more specific about how this matches the above mentioned
usecases?

For the use cases I've listed respectively:
1. Our network service would mount a tmpfs with 'memcg=<path to
client's memcg>'. Any memory the service is allocating on behalf of
the client, the service will allocate inside of this tmpfs mount, thus
charging it to the client's memcg without risk of hitting the
service's limit.
2. The large job (kubernetes pod) would mount a tmpfs with
'memcg=<path to large job's memcg>. It will then share this tmpfs
mount with the subtasks (containers in the pod). The subtasks can then
allocate memory in the tmpfs, having it charged to the kubernetes job,
without risk of hitting the container's limit.
3. We would need to extend this functionality to other file systems of
persistent disk, then mount that file system with 'memcg=<dedicated
shared library memcg>'. Jobs can then use the shared library and any
memory allocated due to loading the shared library is charged to a
dedicated memcg, and not charged to the job using the shared library.

What would/should happen if the target memcg doesn't or stop existing
under remote charger feet?

My thinking is that the tmpfs acts as a charge target to the memcg and
blocks the memcg from being removed until the tmpfs mount is
unmounted, similarly to when a user tries to rmdir a memcg with some
processes still attached to it. But I don't feel strongly about this,
and I'm happy to go with another approach if you have a strong opinion
about this.

quoted

Caveats:
1. One complication to address is the behavior when the target memcg
hits its memory.max limit because of remote charging. In this case the
oom-killer will be invoked, but the oom-killer may not find anything
to kill in the target memcg being charged. In this case, I propose
simply failing the remote charge which will cause the process
executing the remote charge to get an ENOMEM This will be documented
behavior of remote charging.

Say you are in a page fault (#PF) path. If you just return ENOMEM then
you will get a system wide OOM killer via pagefault_out_of_memory. This
is very likely not something you want, right? Even if we remove this
behavior, which is another story, then the #PF wouldn't have other ways
than keep retrying which doesn't really look great either.

The only "reasonable" way I can see right now is kill the remote
charging task. That might result in some other problems though.

Yes! That's exactly what I was thinking, and from discussions with
userspace folks interested in this it doesn't seem like a problem.
We'd kill the remote charging task and make it clear in the
documentation that this is the behavior and the userspace is
responsible for working around that.

Worthy of mention is that if processes A and B are sharing memory via
a tmpfs, they can set memcg=<common ancestor memcg of A and B>. Thus
the memory is charged to a common ancestor of memcgs A and B and if
the common ancestor hits its limit the oom-killer will get invoked and
should always find something to kill. This will also be documented and
the userspace can choose to go this route if they don't want to risk
being killed on pagefault.

quoted

2. I would like to provide an initial implementation that adds this
support for tmpfs, while leaving the implementation generic enough for
myself or others to extend to more filesystems where they find the
feature useful.

How do you envision other filesystems would implement that? Should the
information be persisted in some way?

Yes my initial implementation has a struct memcg* hanging off the
super block that is the memcg to charge, but I can move it if there is
somewhere else you feel is appropriate once I send out the patches.

I didn't have time to give this a lot of thought and more questions will
likely come. My initial reaction is that this will open a lot of
interesting corner cases which will be hard to deal with.

Thank you very much for your review so far and please let me know if
you think of any more issues. My feeling is that hitting the remote
memcg limit and the oom-killing behavior surrounding that is by far
the most contentious issue. You don't seem completely revolted by what
I'm proposing there so I'm somewhat optimistic we can deal with the
rest of the corner cases :-)

--
Michal Hocko
SUSE Labs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help