Re: [PATCH] net/sock: move memory_allocated over to percpu_counter variables
From: Olof Johansson <hidden>
Date: 2018-09-08 17:02:53
Also in:
linux-crypto, linux-sctp, lkml
Hi, On Fri, Sep 7, 2018 at 12:21 AM, Eric Dumazet [off-list ref] wrote:
On Fri, Sep 7, 2018 at 12:03 AM Eric Dumazet [off-list ref] wrote:quoted
Problem is : we have platforms with more than 100 cpus, and sk_memory_allocated() cost will be too expensive, especially if the host is under memory pressure, since all cpus will touch their private counter. per cpu variables do not really scale, they were ok 10 years ago when no more than 16 cpus were the norm. I would prefer change TCP to not aggressively call __sk_mem_reduce_allocated() from tcp_write_timer() Ideally only tcp_retransmit_timer() should attempt to reduce forward allocations, after recurring timeout. Note that after 20c64d5cd5a2bdcdc8982a06cb05e5e1bd851a3d ("net: avoid sk_forward_alloc overflows") we have better control over sockets having huge forward allocations. Something like :Or something less risky :
I gave both of these patches a run, and neither do as well on the
system that has slower atomics. :(
The percpu version:
8.05% workload [kernel.vmlinux]
[k] __do_softirq
7.04% swapper [kernel.vmlinux]
[k] cpuidle_enter_state
5.54% workload [kernel.vmlinux]
[k] _raw_spin_unlock_irqrestore
1.66% swapper [kernel.vmlinux]
[k] __do_softirq
1.55% workload [kernel.vmlinux]
[k] finish_task_switch
1.24% swapper [kernel.vmlinux]
[k] finish_task_switch
1.07% workload [kernel.vmlinux]
[k] net_rx_action
The first patch from you still has significant amount of time spent in
the atomics paths (non-inlined versions used):
7.87% workload [kernel.vmlinux]
[k] __ll_sc_atomic64_sub
7.48% workload [kernel.vmlinux]
[k] __do_softirq
5.05% workload [kernel.vmlinux]
[k] _raw_spin_unlock_irqrestore
2.42% workload [kernel.vmlinux]
[k] __ll_sc_atomic64_add_return
1.49% swapper [kernel.vmlinux]
[k] cpuidle_enter_state
1.31% workload [kernel.vmlinux]
[k] finish_task_switch
1.09% workload [kernel.vmlinux]
[k] tcp_sendmsg_locked
1.08% workload [kernel.vmlinux]
[k] __arch_copy_from_user
1.02% workload [kernel.vmlinux]
[k] net_rx_action
I think a lot of the overhead from percpu approach can be alleviated
if we can use percpu_counter_read() instead of _sum() (i.e. no need to
iterate through the local per-cpu recent delta). I don't know the TCP
stack well enough to tell where it's OK to use a bit of slack in the
numbers though -- by default count will at most be off by 32*online
cpus. Might not be a significant number in reality.
-Olof