Re: [PATCH v3 net-next 12/12] net-memcg: Decouple controlled memcg from global protocol memory accounting.
From: Kuniyuki Iwashima <kuniyu@google.com>
Date: 2025-08-13 18:19:47
Also in:
cgroups, linux-mm, mptcp
On Wed, Aug 13, 2025 at 12:11 AM Shakeel Butt [off-list ref] wrote:
On Tue, Aug 12, 2025 at 05:58:30PM +0000, Kuniyuki Iwashima wrote:quoted
Some protocols (e.g., TCP, UDP) implement memory accounting for socket buffers and charge memory to per-protocol global counters pointed to by sk->sk_proto->memory_allocated. When running under a non-root cgroup, this memory is also charged to the memcg as "sock" in memory.stat. Even when a memcg controls memory usage, sockets of such protocols are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem). This makes it difficult to accurately estimate and configure appropriate global limits, especially in multi-tenant environments. If all workloads were guaranteed to be controlled under memcg, the issue could be worked around by setting tcp_mem[0~2] to UINT_MAX. In reality, this assumption does not always hold, and processes that belong to the root cgroup or opt out of memcg can consume memory up to the global limit, becoming a noisy neighbour.Processes running in root memcg (I am not sure what does 'opt out of memcg means')
Sorry, I should've clarified memory.max==max (and same up to all ancestors as you pointed out below) as opt-out, where memcg works but has no effect.
means admin has intentionally allowed scenarios where
Not really intentionally, but rather reluctantly because the admin cannot guarantee memory.max solely without tcp_mem=UINT_MAX. We should not disregard the cause that the two mem accounting are coupled now.
noisy neighbour situation can happen, so I am not really following your argument here.
So basically here I meant with tcp_mem=UINT_MAX any process can be noisy neighbour unnecessarily.
quoted
Let's decouple memcg from the global per-protocol memory accounting if it has a finite memory.max (!= "max").Why decouple only for some? (Also if you really want to check memcg limits, you need to check limits for all ancestors and not just the given memcg).
Oh, I assumed memory.max will be inherited to descendants.
Why not start with just two global options (maybe start with boot parameter)? Option 1: Existing behavior where memcg and global TCP accounting are coupled. Option 2: Completely decouple memcg and global TCP accounting i.e. use mem_cgroup_sockets_enabled to either do global TCP accounting or memcg accounting. Keep the option 1 default. I assume you want third option where a mix of these options can happen i.e. some sockets are only accounted to a memcg and some are accounted to both memcg and global TCP.
Yes because usually not all memcg have memory.max configured and we do not want to allow unlimited TCP memory for them. Option 2 works for processes in the root cgroup but doesn't for processes in non-root cgroup with memory.max == max. A good example is system processes managed by systemd where we do not want to specify memory.max but want a global seatbelt. Note this is how it works _now_, and we want to _preserve_ the case. Does this make sense ? > why decouple only for some
I would recommend to make that a followup patch series. Keep this series simple and non-controversial.
I can separate the series, but I'd like to make sure the Option 2 is a must for you or Meta configured memory.max for all cgroups ? I didn't think it's likely but if there's a real use case, I'm happy to add a boot param. The only diff would be boot param addition and the condition change in patch 11 so simplicity won't change.