Re: SO_REUSEPORT - can it be done in kernel?
From: Eric Dumazet <hidden>
Date: 2011-02-28 14:55:15
Le lundi 28 février 2011 à 14:32 +0100, Eric Dumazet a écrit :
Le lundi 28 février 2011 à 19:36 +0800, Herbert Xu a écrit :quoted
On Sun, Feb 27, 2011 at 07:06:14PM +0800, Herbert Xu wrote:quoted
I'm working on this right now.OK I think I was definitely on the right track. With the send patch made lockless I now get numbers which are even better than those obtained with running named with multiple sockets. That's right, a single socket is now faster than what multiple sockets were without the patch (of course, multiple sockets may still faster with the patch vs. a single socket for obvious reasons, but I couldn't measure any significant difference). Also worthy of note is that prior to the patch all CPUs showed idleness (lazy bastards!), with the patch they're all maxed out. In retrospect, the idleness was simply the result of the socket lock scheduling away and was an indication of lock contention.Now, input path can run without finding socket locked by xmit path, so skb are queued into receive queue, not backlog one.quoted
Here are the patches I used. Please don't them yet as I intend to clean them up quite a bit. But please do test them heavily, especially if you have an AMD NUMA machine as that's where scalability problems really show up. Intel tends to be a lot more forgiving. My last AMD machine blew up years ago :)I am going to test them, thanks !
First "sending only" tests on my 2x4x2 machine (two E5540@2.53GHz, quad
core, hyper threaded, NUMA kernel)
16 threads, each one sending 100.000 UDP frames using a _shared_ socket
I use the same destination IP, so suffer a bit of dst refcount
contention.
(to dummy0 device to avoid contention on qdisc and device)
# ip ro get 10.2.2.21
10.2.2.21 dev dummy0 src 10.2.2.2
cache
LOCKDEP enabled kernel
Before :
time ./udpflood -f -t 16 -l 100000 10.2.2.21
real 0m42.749s
user 0m1.010s
sys 1m38.039s
After :
time ./udpflood -f -t 16 -l 100000 10.2.2.21
real 0m1.167s
user 0m0.488s
sys 0m17.373s
With one thread only and 16*100000 frames :
# time ./udpflood -f -l 1600000 10.2.2.21
real 0m9.318s
user 0m0.238s
sys 0m9.052s
(We have some false sharing on atomic fields in struct file and socket,
but nothing to worry about.)
With LOCKDEP OFF :
16 threads :
# time ./udpflood -f -t 16 -l 100000 10.2.2.21
real 0m0.718s
user 0m0.376s
sys 0m10.963s
1 thread :
# time ./udpflood -f -l 1600000 10.2.2.21
real 0m1.514s
user 0m0.153s
sys 0m1.357s
"perf record/report" results for the 16 threads case (no lockdep)
# Events: 389K cpu-clock-msecs
#
# Overhead Command Shared Object Symbol
# ........ ........... ................... ...................................
#
9.03% udpflood [kernel.kallsyms] [k] sock_wfree
8.58% udpflood [kernel.kallsyms] [k] __ip_route_output_key
8.52% udpflood [kernel.kallsyms] [k] sock_alloc_send_pskb
7.46% udpflood [kernel.kallsyms] [k] sock_def_write_space
6.76% udpflood [kernel.kallsyms] [k] __xfrm_lookup
6.18% swapper [kernel.kallsyms] [k] acpi_idle_enter_bm
5.66% udpflood [kernel.kallsyms] [k] dst_release
4.96% udpflood [kernel.kallsyms] [k] udp_sendmsg
3.48% udpflood [kernel.kallsyms] [k] fget_light
2.75% udpflood [kernel.kallsyms] [k] sock_tx_timestamp
2.40% udpflood [kernel.kallsyms] [k] __ip_make_skb
2.36% udpflood [kernel.kallsyms] [k] fput
1.87% swapper [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.81% udpflood [kernel.kallsyms] [k] inet_sendmsg
1.53% udpflood [kernel.kallsyms] [k] sys_sendto
1.50% udpflood [kernel.kallsyms] [k] ip_finish_output
1.31% udpflood [kernel.kallsyms] [k] csum_partial_copy_generic
1.30% udpflood udpflood [.] do_thread
1.28% udpflood [kernel.kallsyms] [k] __ip_append_data
1.08% udpflood [kernel.kallsyms] [k] __memset
1.05% udpflood [kernel.kallsyms] [k] ip_route_output_flow
0.91% udpflood [kernel.kallsyms] [k] kfree
0.88% udpflood [vdso] [.] 0xffffe430
0.83% udpflood [kernel.kallsyms] [k] copy_user_generic_string
0.78% udpflood libc-2.3.4.so [.] __GI_memcpy
0.77% udpflood [kernel.kallsyms] [k] ia32_sysenter_target
What do you suggest to perform a bind based test ?