Re: [PATCH] mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges
From: Shakeel Butt <hidden>
Date: 2019-10-23 17:38:53
Also in:
cgroups, linux-mm, lkml
On Wed, Oct 23, 2019 at 8:46 AM Johannes Weiner [off-list ref] wrote:
On Wed, Oct 23, 2019 at 08:40:12AM +0200, Michal Hocko wrote:quoted
On Tue 22-10-19 19:37:08, Johannes Weiner wrote:quoted
While upgrading from 4.16 to 5.2, we noticed these allocation errors in the log of the new kernel: [ 8642.253395] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC) [ 8642.269170] cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0 [ 8642.293009] node 0: slabs: 5, objs: 170, free: 0 slab_out_of_memory+1 ___slab_alloc+969 __slab_alloc+14 kmem_cache_alloc+346 inet_twsk_alloc+60 tcp_time_wait+46 tcp_fin+206 tcp_data_queue+2034 tcp_rcv_state_process+784 tcp_v6_do_rcv+405 __release_sock+118 tcp_close+385 inet_release+46 __sock_release+55 sock_close+17 __fput+170 task_work_run+127 exit_to_usermode_loop+191 do_syscall_64+212 entry_SYSCALL_64_after_hwframe+68 accompanied by an increase in machines going completely radio silent under memory pressure.This is really worrying because that suggests that something depends on GFP_ATOMIC allocation which is fragile and broken.I don't think that is true. You cannot rely on a *single instance* of atomic allocations to succeed. But you have to be able to rely on that failure is temporary and there is a chance of succeeding eventually. Network is a good example. It retries transmits, but within reason. If you aren't able to process incoming packets for minutes, you might as well be dead.quoted
quoted
One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account sock objects to kmemcg"), which made these slab caches subject to cgroup memory accounting and control. The problem with that is that cgroups, unlike the page allocator, do not maintain dedicated atomic reserves. As a cgroup's usage hovers at its limit, atomic allocations - such as done during network rx - can fail consistently for extended periods of time. The kernel is not able to operate under these conditions. We don't want to revert the culprit patch, because it indeed tracks a potentially substantial amount of memory used by a cgroup. We also don't want to implement dedicated atomic reserves for cgroups. There is no point in keeping a fixed margin of unused bytes in the cgroup's memory budget to accomodate a consumer that is impossible to predict - we'd be wasting memory and get into configuration headaches, not unlike what we have going with min_free_kbytes. We do this for physical mem because we have to, but cgroups are an accounting game. Instead, account these privileged allocations to the cgroup, but let them bypass the configured limit if they have to. This way, we get the benefits of accounting the consumed memory and have it exert pressure on the rest of the cgroup, but like with the page allocator, we shift the burden of reclaimining on behalf of atomic allocations onto the regular allocations that can block.On the other hand this would allow to break the isolation by an unpredictable amount. Should we put a simple cap on how much we can go over the limit. If the memcg limit reclaim is not able to keep up with those overflows then even __GFP_ATOMIC allocations have to fail. What do you think?I don't expect a big overrun in practice, and it appears that Google has been letting even NOWAIT allocations pass through without isolation issues.
We have been overcharging for __GFP_HIGH allocations for couple of years and see no isolation issues in the production.
Likewise, we have been force-charging the skmem for a while now and it hasn't been an issue for reclaim to keep up. My experience from production is that it's a whole lot easier to debug something like a memory.max overrun than it is to debug a machine that won't respond to networking. So that's the side I would err on.