Re: epoll_wait() performance

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: 2019-11-27 19:48:54
Also in: lkml

On Wed, Nov 27, 2019 at 11:04 AM David Laight [off-list ref] wrote:

From: Jesper Dangaard Brouer

quoted

Sent: 27 November 2019 15:48
On Wed, 27 Nov 2019 10:39:44 +0000 David Laight [off-list ref] wrote:

quoted

...

quoted

While using recvmmsg() to read multiple messages might seem a good idea, it is much
slower than recv() when there is only one message (even recvmsg() is a lot slower).
(I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
and faffing with the user iov[].)

So using poll() we repoll the fd after calling recv() to find is there is a second message.
However the second poll has a significant performance cost (but less than using recvmmsg()).

That sounds wrong. Single recvmmsg(), even when receiving only a
single message, should be faster than two syscalls - recv() and
poll().

My suspicion is the extra two copy_from_user() needed for each recvmsg are a
significant overhead, most likely due to the crappy code that tries to stop
the kernel buffer being overrun.

I need to run the tests on a system with a 'home built' kernel to see how much
difference this make (by seeing how much slower duplicating the copy makes it).

The system call cost of poll() gets factored over a reasonable number of sockets.
So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
even allowing for looking up the fd.

This could be fixed by an extra flag to recvmmsg() to indicate that you only really
expect one message and to call the poll() function before each subsequent receive.

There is also the 'reschedule' that Eric added to the loop in recvmmsg.
I don't know how much that actually costs.
In this case the process is likely to be running at a RT priority and pinned to a cpu.
In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.

We really do want to receive all these UDP packets in a timely manner.
Although very low latency isn't itself an issue.
The data is telephony audio with (typically) one packet every 20ms.
The code only looks for packets every 10ms - that helps no end since, in principle,
only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.

I have a simple udp_sink tool[1] that cycle through the different
receive socket system calls.  I gave it a quick spin on a F31 kernel
5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
to see a significant regression/slowdown for recvMmsg.

$ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
              run      count          ns/pkt  pps             cycles  payload
recvMmsg/32   run:  0 10000000        1461.41 684270.96       5261    18       demux:1
recvmsg       run:  0 10000000        889.82  1123824.84      3203    18       demux:1
read          run:  0 10000000        974.81  1025841.68      3509    18       demux:1
recvfrom      run:  0 10000000        1056.51 946513.44       3803    18       demux:1

Normal recvmsg almost have double performance that recvmmsg.
 recvMmsg/32 = 684,270 pps
 recvmsg     = 1,123,824 pps

Can you test recv() as well?
I think it might be faster than read().

...

quoted

Found some old results (approx v4.10-rc1):

[brouer@skylake src]$ sudo taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --connect
 recvMmsg/32    run: 0 10000000 537.89  1859106.74      2155    21559353816
 recvmsg        run: 0 10000000 552.69  1809344.44      2215    22152468673
 read           run: 0 10000000 476.65  2097970.76      1910    19104864199
 recvfrom       run: 0 10000000 450.76  2218492.60      1806    18066972794

That is probably nearer what I am seeing on a 4.15 Ubuntu 18.04 kernel.
recvmmsg() and recvmsg() are similar - but both a lot slower then recv().

Indeed, surprising that recv(from) would be less efficient than recvmsg.

Are the latest numbers with CONFIG_HARDENED_USERCOPY?

I assume that the poll() after recv() is non-blocking. If using
recvmsg, that extra syscall could be avoided by implementing a cmsg
inq hint for udp sockets analogous to TCP_CM_INQ/tcp_inq_hint.

More outlandish would be to abuse the mmsghdr->msg_len field to pass
file descriptors and amortize the kernel page-table isolation cost
across sockets. Blocking semantics would be weird, for starters.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help