Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
From: Eric Dumazet <hidden>
Date: 2015-10-23 13:02:23
On Fri, 2015-10-23 at 11:52 +0200, Casper.Dik@oracle.com wrote:
quoted
Ho-hum... It could even be made lockless in fast path; the problems I see are * descriptor-to-file lookup becomes unsafe in a lot of locking conditions. Sure, most of that happens on the entry to some syscall, with very light locking environment, but... auditing every sodding ioctl that might be doing such lookups is an interesting exercise, and then there are ->mount() instances doing the same thing. And procfs accesses. Probably nothing impossible to deal with, but nothing pleasant either.In the Solaris kernel code, the ioctl code is generally not handled a file descriptor but instead a file pointer (i.e., the lookup is done early in the system call). In those specific cases where a system call needs to convert a file descriptor to a file pointer, there is only one routines which can be used.quoted
* memory footprint. In case of Linux on amd64 or sparc64, main() { int i; for (i = 0; i < 1<<24; dup2(0, i++)) // 16M descriptors ; } will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient ulimit -n, of course). How much will Solaris eat on the same?Yeah, that is a large amount of memory. Of course, the table is only sized when it is extended and there is a reason why there is a limit on file descriptors. But we're using more data per file descriptor entry.quoted
* related to the above - how much cacheline sharing will that involve? These per-descriptor use counts are bitch to pack, and giving each a cacheline of its own... <shudder>As I said, we do actually use a lock and yes that means that you really want to have a single cache line for each and every entry. It does make it easy to have non-racy file description updates. You certainly do not want false sharing when there is a lot of contention. Other data is used to make sure that it only takes O(log(n)) to find the lowest available file descriptor entry. (Where n, I think, is the returned descriptor)
Yet another POSIX deficiency. When a server deals with 10,000,000+ socks, we absolutely do not care of this requirement. O(log(n)) is still crazy if it involves O(log(n)) cache misses.
Not contended locks aren't expensive. And all is done on a single cache line. One question about the Linux implementation: what happens when a socket in select is closed? I'm assuming that the kernel waits until "shutdown" is given or when a connection comes in? Is it a problem that you can "hide" your listening socket with a thread in accept()? I would think so (It would be visible in netstat but you can't easily find out why has it)
Again, netstat -p on a server with 10,000,000 sockets never completes. Never try this unless you are desperate and want to avoid a reboot maybe. If you absolutely want to nuke a listener because of untrusted applications, we better implement a proper syscall. Android has such a facility. Alternative would be to extend netlink (ss command from iproute2 package) to carry one pid per socket. ss -atnp state listening -> would not have to readlink (/proc/*/fd/*)