Re: epoll_wait() performance
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: 2019-11-27 19:48:54
Also in:
lkml
On Wed, Nov 27, 2019 at 11:04 AM David Laight [off-list ref] wrote:
From: Jesper Dangaard Brouerquoted
Sent: 27 November 2019 15:48 On Wed, 27 Nov 2019 10:39:44 +0000 David Laight [off-list ref] wrote:quoted
...quoted
quoted
While using recvmmsg() to read multiple messages might seem a good idea, it is much slower than recv() when there is only one message (even recvmsg() is a lot slower). (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user() and faffing with the user iov[].) So using poll() we repoll the fd after calling recv() to find is there is a second message. However the second poll has a significant performance cost (but less than using recvmmsg()).That sounds wrong. Single recvmmsg(), even when receiving only a single message, should be faster than two syscalls - recv() and poll().My suspicion is the extra two copy_from_user() needed for each recvmsg are a significant overhead, most likely due to the crappy code that tries to stop the kernel buffer being overrun. I need to run the tests on a system with a 'home built' kernel to see how much difference this make (by seeing how much slower duplicating the copy makes it). The system call cost of poll() gets factored over a reasonable number of sockets. So doing poll() on a socket with no data is a lot faster that the setup for recvmsg even allowing for looking up the fd. This could be fixed by an extra flag to recvmmsg() to indicate that you only really expect one message and to call the poll() function before each subsequent receive. There is also the 'reschedule' that Eric added to the loop in recvmmsg. I don't know how much that actually costs. In this case the process is likely to be running at a RT priority and pinned to a cpu. In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it. We really do want to receive all these UDP packets in a timely manner. Although very low latency isn't itself an issue. The data is telephony audio with (typically) one packet every 20ms. The code only looks for packets every 10ms - that helps no end since, in principle, only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.I have a simple udp_sink tool[1] that cycle through the different receive socket system calls. I gave it a quick spin on a F31 kernel 5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised to see a significant regression/slowdown for recvMmsg. $ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7)) run count ns/pkt pps cycles payload recvMmsg/32 run: 0 10000000 1461.41 684270.96 5261 18 demux:1 recvmsg run: 0 10000000 889.82 1123824.84 3203 18 demux:1 read run: 0 10000000 974.81 1025841.68 3509 18 demux:1 recvfrom run: 0 10000000 1056.51 946513.44 3803 18 demux:1 Normal recvmsg almost have double performance that recvmmsg. recvMmsg/32 = 684,270 pps recvmsg = 1,123,824 ppsCan you test recv() as well? I think it might be faster than read(). ...quoted
Found some old results (approx v4.10-rc1): [brouer@skylake src]$ sudo taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --connect recvMmsg/32 run: 0 10000000 537.89 1859106.74 2155 21559353816 recvmsg run: 0 10000000 552.69 1809344.44 2215 22152468673 read run: 0 10000000 476.65 2097970.76 1910 19104864199 recvfrom run: 0 10000000 450.76 2218492.60 1806 18066972794That is probably nearer what I am seeing on a 4.15 Ubuntu 18.04 kernel. recvmmsg() and recvmsg() are similar - but both a lot slower then recv().
Indeed, surprising that recv(from) would be less efficient than recvmsg. Are the latest numbers with CONFIG_HARDENED_USERCOPY? I assume that the poll() after recv() is non-blocking. If using recvmsg, that extra syscall could be avoided by implementing a cmsg inq hint for udp sockets analogous to TCP_CM_INQ/tcp_inq_hint. More outlandish would be to abuse the mmsghdr->msg_len field to pass file descriptors and amortize the kernel page-table isolation cost across sockets. Blocking semantics would be weird, for starters.