Re: [PATCH v3 2/4] mm/oom: handle remote ooms

From: Mina Almasry <hidden>
Date: 2021-11-16 21:27:48
Also in: cgroups, linux-fsdevel

On Tue, Nov 16, 2021 at 3:29 AM Michal Hocko [off-list ref] wrote:

On Tue 16-11-21 02:17:09, Mina Almasry wrote:

quoted

On Tue, Nov 16, 2021 at 1:28 AM Michal Hocko [off-list ref] wrote:

quoted

On Mon 15-11-21 16:58:19, Mina Almasry wrote:

quoted

On Mon, Nov 15, 2021 at 2:58 AM Michal Hocko [off-list ref] wrote:

quoted

On Fri 12-11-21 09:59:22, Mina Almasry wrote:

quoted

On Fri, Nov 12, 2021 at 12:36 AM Michal Hocko [off-list ref] wrote:

quoted

On Fri 12-11-21 00:12:52, Mina Almasry wrote:

quoted

On Thu, Nov 11, 2021 at 11:52 PM Michal Hocko [off-list ref] wrote:

quoted

On Thu 11-11-21 15:42:01, Mina Almasry wrote:

quoted

On remote ooms (OOMs due to remote charging), the oom-killer will attempt
to find a task to kill in the memcg under oom, if the oom-killer
is unable to find one, the oom-killer should simply return ENOMEM to the
allocating process.

This really begs for some justification.

I'm thinking (and I can add to the commit message in v4) that we have
2 reasonable options when the oom-killer gets invoked and finds
nothing to kill: (1) return ENOMEM, (2) kill the allocating task. I'm
thinking returning ENOMEM allows the application to gracefully handle
the failure to remote charge and continue operation.

For example, in the network service use case that I mentioned in the
RFC proposal, it's beneficial for the network service to get an ENOMEM
and continue to service network requests for other clients running on
the machine, rather than get oom-killed when hitting the remote memcg
limit. But, this is not a hard requirement, the network service could
fork a process that does the remote charging to guard against the
remote charge bringing down the entire process.

This all belongs to the changelog so that we can discuss all potential
implication and do not rely on any implicit assumptions.

Understood. Maybe I'll wait to collect more feedback and upload v4
with a thorough explanation of the thought process.

quoted

E.g. why does
it even make sense to kill a task in the origin cgroup?

The behavior I saw returning ENOMEM for this edge case was that the
code was forever looping the pagefault, and I was (seemingly
incorrectly) under the impression that a suggestion to forever loop
the pagefault would be completely fundamentally unacceptable.

Well, I have to say I am not entirely sure what is the best way to
handle this situation. Another option would be to treat this similar to
ENOSPACE situation. This would result into SIGBUS IIRC.

The main problem with OOM killer is that it will not resolve the
underlying problem in most situations. Shmem files would likely stay
laying around and their charge along with them. Killing the allocating
task has problems on its own because this could be just a DoS vector by
other unrelated tasks sharing the shmem mount point without a gracefull
fallback. Retrying the page fault is hard to detect. SIGBUS might be
something that helps with the latest. The question is how to communicate
this requerement down to the memcg code to know that the memory reclaim
should happen (Should it? How hard we should try?) but do not invoke the
oom killer. The more I think about this the nastier this is.

So actually I thought the ENOSPC suggestion was interesting so I took
the liberty to prototype it. The changes required:

1. In out_of_memory() we return false if !oc->chosen &&
is_remote_oom(). This gets bubbled up to try_charge_memcg() as
mem_cgroup_oom() returning OOM_FAILED.
2. In try_charge_memcg(), if we get an OOM_FAILED we again check
is_remote_oom(), if it is a remote oom, return ENOSPC.
3. The calling code would return ENOSPC to the user in the no-fault
path, and SIGBUS the user in the fault path with no changes.

I think this should be implemented at the caller side rather than
somehow hacked into the memcg core. It is the caller to know what to do.
The caller can use gfp flags to control the reclaim behavior.

Hmm I'm a bit struggling to envision this.  So would it be acceptable
at the call sites where we doing a remote charge, such as
shmem_add_to_page_cache(), if we get ENOMEM from the
mem_cgroup_charge(), and we know we're doing a remote charge (because
current's memcg != the super block memcg), then we return ENOSPC from
shmem_add_to_page_cache()? I believe that will return ENOSPC to the
userspace in the non-pagefault path and SIGBUS in the pagefault path.
Or you had something else in mind?

Yes, exactly. I meant that all this special casing would be done at the
shmem layer as it knows how to communicate this usecase.

Awesome. The more I think of it I think the ENOSPC handling is perfect
for this use case, because it gives all users of the shared memory and
remote chargers a chance to gracefully handle the ENOSPC or the SIGBUS
when we hit the nothing to kill case. The only issue is finding a
clean implementation, and if the implementation I just proposed sounds
good to you then I see no issues and I'm happy to submit this in the
next version. Shakeel and others I would love to know what you think
either now or when I post the next version.

[...]

quoted

And just a small clarification. Tmpfs is fundamentally problematic from
the OOM handling POV. The nuance here is that the OOM happens in a
different memcg and thus a different resource domain. If you kill a task
in the target memcg then you effectively DoS that workload. If you kill
the allocating task then it is DoSed by anybody allowed to write to that
shmem. All that without a graceful fallback.

I don't know if this addresses your concern, but I'm limiting the
memcg= use to processes that can enter that memcg. Therefore they
would be able to allocate memory in that memcg anyway by entering it.
So if they wanted to intentionally DoS that memcg they can already do
it without this feature.

Can you elaborate some more? How do you enforce that the mount point
cannot be accessed by anybody outside of that constraint?

So if I'm a bad actor that wants to intentionally DoS random memcgs on
the system I can:

mount -t tmpfs -o memcg=/sys/fs/cgroup/unified/memcg-to-dos tmpfs /mnt/tmpfs
cat /dev/random > /mnt/tmpfs

That will reliably DoS the container. But we only allow you to mount
with memcg=/sys/fs/cgroup/unified/memcg-to-dos if you're able to enter
that memcg, so I can just do:

echo $$ > /sys/fs/cgroup/unified/memcg-to-dos/cgroup.procs
allocate_infinited_memory()

So we haven't added an attack vector really. More reasonably a sys
admin will set up a tmpfs mount with
memcg=/sys/fs/cgroup/unified/shared-memory-owner, and set the limit of
the shared-memory-owner to be big enough to handle the tasks running
in that memcg _and_ all the shared memory. The sys admin can also
limit the tmpfs with the size= option to limit the total size of the
shared memory. I think the sys admin could also set permissions on the
mount so only the users that share the memory can read/write, etc.

I'm sorry if this wasn't clear before and I'll take a good look at the
commit messages I'm writing and put as much info as possible in each.

As always thank you very much for your review and feedback.

--
Michal Hocko
SUSE Labs

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help