Thread (138 messages) 138 messages, 9 authors, 2015-11-06

Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

From: Eric Dumazet <hidden>
Date: 2015-10-23 13:02:23

On Fri, 2015-10-23 at 11:52 +0200, Casper.Dik@oracle.com wrote:
quoted
Ho-hum...  It could even be made lockless in fast path; the problems I see
are
* descriptor-to-file lookup becomes unsafe in a lot of locking
conditions.  Sure, most of that happens on the entry to some syscall, with
very light locking environment, but... auditing every sodding ioctl that
might be doing such lookups is an interesting exercise, and then there are
->mount() instances doing the same thing.  And procfs accesses.  Probably
nothing impossible to deal with, but nothing pleasant either.
In the Solaris kernel code, the ioctl code is generally not handled a file 
descriptor but instead a file pointer (i.e., the lookup is done early in 
the system call).

In those specific cases where a system call needs to convert a file 
descriptor to a file pointer, there is only one routines which can be used.
quoted
* memory footprint.  In case of Linux on amd64 or sparc64,
main()
{
int i;
for (i = 0; i < 1<<24; dup2(0, i++))	// 16M descriptors
	;
}
will chew 132Mb of kernel data (16Mpointer + 32Mbit, assuming sufficient ulimit -n,
of course).  How much will Solaris eat on the same?
Yeah, that is a large amount of memory.  Of course, the table is only 
sized when it is extended and there is a reason why there is a limit on 
file descriptors.  But we're using more data per file descriptor entry.

quoted
* related to the above - how much cacheline sharing will that involve?
These per-descriptor use counts are bitch to pack, and giving each a cacheline
of its own...  <shudder>
As I said, we do actually use a lock and yes that means that you really  
want to have a single cache line for each and every entry.  It does make 
it easy to have non-racy file description updates.  You certainly do not 
want false sharing when there is a lot of contention.

Other data is used to make sure that it only takes O(log(n)) to find the 
lowest available file descriptor entry.  (Where n, I think, is the returned
descriptor)
Yet another POSIX deficiency.

When a server deals with 10,000,000+ socks, we absolutely do not care of
this requirement.

O(log(n)) is still crazy if it involves O(log(n)) cache misses.
Not contended locks aren't expensive.  And all is done on a single cache 
line.

One question about the Linux implementation: what happens when a socket in 
select is closed?  I'm assuming that the kernel waits until "shutdown" is 
given or when a connection comes in?

Is it a problem that you can "hide" your listening socket with a thread in 
accept()?  I would think so (It would be visible in netstat but you can't 
easily find out why has it)
Again, netstat -p on a server with 10,000,000 sockets never completes.

Never try this unless you are desperate and want to avoid a reboot
maybe.

If you absolutely want to nuke a listener because of untrusted
applications, we better implement a proper syscall.

Android has such a facility.

Alternative would be to extend netlink (ss command from iproute2
package) to carry one pid per socket.

ss -atnp state listening

-> would not have to readlink (/proc/*/fd/*)
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help