Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from... | netdev

Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.

From: Shakeel Butt <shakeel.butt@linux.dev>
Date: 2025-08-13 20:54:12
Also in: bpf, cgroups, linux-mm, mptcp

On Wed, Aug 13, 2025 at 11:19:31AM -0700, Kuniyuki Iwashima wrote:

On Wed, Aug 13, 2025 at 12:11 AM Shakeel Butt [off-list ref] wrote:

quoted

On Tue, Aug 12, 2025 at 05:58:30PM +0000, Kuniyuki Iwashima wrote:

quoted

Some protocols (e.g., TCP, UDP) implement memory accounting for socket
buffers and charge memory to per-protocol global counters pointed to by
sk->sk_proto->memory_allocated.

When running under a non-root cgroup, this memory is also charged to the
memcg as "sock" in memory.stat.

Even when a memcg controls memory usage, sockets of such protocols are
still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).

This makes it difficult to accurately estimate and configure appropriate
global limits, especially in multi-tenant environments.

If all workloads were guaranteed to be controlled under memcg, the issue
could be worked around by setting tcp_mem[0~2] to UINT_MAX.

In reality, this assumption does not always hold, and processes that
belong to the root cgroup or opt out of memcg can consume memory up to
the global limit, becoming a noisy neighbour.

Processes running in root memcg (I am not sure what does 'opt out of
memcg means')

Sorry, I should've clarified memory.max==max (and same
up to all ancestors as you pointed out below) as opt-out,
where memcg works but has no effect.

quoted

means admin has intentionally allowed scenarios where

Not really intentionally, but rather reluctantly because the admin
cannot guarantee memory.max solely without tcp_mem=UINT_MAX.
We should not disregard the cause that the two mem accounting are
coupled now.

quoted

noisy neighbour situation can happen, so I am not really following your
argument here.

So basically here I meant with tcp_mem=UINT_MAX any process
can be noisy neighbour unnecessarily.

Only if there are processes in cgroups with unlimited memory limits.

I think you are still missing the point. So, let me be very clear:

Please stop using the "processes in cgroup with memory.max==max can be
source of isolation issues" argument. Having unlimited limit means you
don't want isolation. More importantly you don't really need this
argument for your work. It is clear (to me at least) that we want global
TCP memory accounting to be decoupled from memcg accounting. Using the
flawed argument is just making your series controversial.

[...]

quoted

Why not start with just two global options (maybe start with boot
parameter)?

Option 1: Existing behavior where memcg and global TCP accounting are
coupled.

Option 2: Completely decouple memcg and global TCP accounting i.e. use
mem_cgroup_sockets_enabled to either do global TCP accounting or memcg
accounting.

Keep the option 1 default.

I assume you want third option where a mix of these options can happen
i.e. some sockets are only accounted to a memcg and some are accounted
to both memcg and global TCP.

Yes because usually not all memcg have memory.max configured
and we do not want to allow unlimited TCP memory for them.

Option 2 works for processes in the root cgroup but doesn't for
processes in non-root cgroup with memory.max == max.

A good example is system processes managed by systemd where
we do not want to specify memory.max but want a global seatbelt.

Note this is how it works _now_, and we want to _preserve_ the case.
Does this make sense  ? > why decouple only for some

I hope I am very clear to stop using the memory.max == max argument.

quoted

I would recommend to make that a followup
patch series. Keep this series simple and non-controversial.

I can separate the series, but I'd like to make sure the
Option 2 is a must for you or Meta configured memory.max
for all cgroups ?  I didn't think it's likely but if there's a real
use case, I'm happy to add a boot param.

The only diff would be boot param addition and the condition
change in patch 11 so simplicity won't change.

I am not sure if option 2 will be used by Meta or someone else, so no
objection from me to not pursue it. However I don't want some possibly
userspace policy to opt-in in one or the other accounting mechanism in
the kernel.

What I think is the right approach is to have BPF struct ops based
approach with possible callback 'is this socket under pressure' or maybe
'is this socket isolated' and then you can do whatever you want in those
callbacks. In this way your can follow the same approach of caching the
result in kernel (lower bits of sk->sk_memcg).

I am CCing bpf list to get some suggestions or concerns on this
approach.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help