Re: [PATCH] net/sock: move memory_allocated over to percpu_counter variables

From: Olof Johansson <hidden>
Date: 2018-09-08 17:02:53
Also in: linux-crypto, linux-sctp, lkml

Hi,

On Fri, Sep 7, 2018 at 12:21 AM, Eric Dumazet [off-list ref] wrote:

On Fri, Sep 7, 2018 at 12:03 AM Eric Dumazet [off-list ref] wrote:

quoted

Problem is : we have platforms with more than 100 cpus, and
sk_memory_allocated() cost will be too expensive,
especially if the host is under memory pressure, since all cpus will
touch their private counter.

per cpu variables do not really scale, they were ok 10 years ago when
no more than 16 cpus were the norm.

I would prefer change TCP to not aggressively call
__sk_mem_reduce_allocated() from tcp_write_timer()

Ideally only tcp_retransmit_timer() should attempt to reduce forward
allocations, after recurring timeout.

Note that after 20c64d5cd5a2bdcdc8982a06cb05e5e1bd851a3d ("net: avoid
sk_forward_alloc overflows")
we have better control over sockets having huge forward allocations.

Something like :

Or something less risky :

I gave both of these patches a run, and neither do as well on the
system that has slower atomics. :(

The percpu version:

     8.05%  workload         [kernel.vmlinux]
    [k] __do_softirq
     7.04%  swapper          [kernel.vmlinux]
    [k] cpuidle_enter_state
     5.54%  workload         [kernel.vmlinux]
    [k] _raw_spin_unlock_irqrestore
     1.66%  swapper          [kernel.vmlinux]
    [k] __do_softirq
     1.55%  workload         [kernel.vmlinux]
    [k] finish_task_switch
     1.24%  swapper          [kernel.vmlinux]
    [k] finish_task_switch
     1.07%  workload         [kernel.vmlinux]
    [k] net_rx_action

The first patch from you still has significant amount of time spent in
the atomics paths (non-inlined versions used):

     7.87%  workload         [kernel.vmlinux]
[k] __ll_sc_atomic64_sub
     7.48%  workload         [kernel.vmlinux]
[k] __do_softirq
     5.05%  workload         [kernel.vmlinux]
[k] _raw_spin_unlock_irqrestore
     2.42%  workload         [kernel.vmlinux]
[k] __ll_sc_atomic64_add_return
     1.49%  swapper          [kernel.vmlinux]
[k] cpuidle_enter_state
     1.31%  workload         [kernel.vmlinux]
[k] finish_task_switch
     1.09%  workload         [kernel.vmlinux]
[k] tcp_sendmsg_locked
     1.08%  workload         [kernel.vmlinux]
[k] __arch_copy_from_user
     1.02%  workload         [kernel.vmlinux]
[k] net_rx_action

I think a lot of the overhead from percpu approach can be alleviated
if we can use percpu_counter_read() instead of _sum() (i.e. no need to
iterate through the local per-cpu recent delta). I don't know the TCP
stack well enough to tell where it's OK to use a bit of slack in the
numbers though -- by default count will at most be off by 32*online
cpus. Might not be a significant number in reality.


-Olof

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help