Re: [PATCH] /proc/net/tcp, overhead removed

From: Yakov Lerner <hidden>
Date: 2009-09-29 07:43:08

On Tue, Sep 29, 2009 at 02:24, Stephen Hemminger [off-list ref] wrote:

On Tue, 29 Sep 2009 00:20:07 +0200
Eric Dumazet [off-list ref] wrote:

quoted

Yakov Lerner a écrit :

quoted

On Sun, Sep 27, 2009 at 12:53, Eric Dumazet [off-list ref] wrote:

quoted

Yakov Lerner a écrit :

quoted

/proc/net/tcp does 20,000 sockets in 60-80 milliseconds, with this patch.

The overhead was in tcp_seq_start(). See analysis (3) below.
The patch is against Linus git tree (1). The patch is small.

------------  -----------   ------------------------------------
Before patch  After patch   20,000 sockets (10,000 tw + 10,000 estab)(2)
------------  -----------   ------------------------------------
6 sec          0.06 sec     dd bs=1k if=/proc/net/tcp >/dev/null
1.5 sec        0.06 sec     dd bs=4k if=/proc/net/tcp >/dev/null

1.9 sec        0.16 sec     netstat -4ant >/dev/null
------------  -----------   ------------------------------------

This is ~ x25 improvement.
The new time is not dependent on read blockize.
Speed of netstat, naturally, improves, too; both -4 and -6.
/proc/net/tcp6 does 20,000 sockets in 100 millisec.

(1) against git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

(2) Used 'manysock' utility to stress system with large number of sockets:
  "manysock 10000 10000"    - 10,000 tw + 10,000 estab ip4 sockets.
  "manysock -6 10000 10000" - 10,000 tw + 10,000 estab ip6 sockets.
Found at http://ilerner.3b1.org/manysock/manysock.c

(3) Algorithmic analysis.
    Old algorithm.

During 'cat </proc/net/tcp', tcp_seq_start() is called O(numsockets) times (4).
On average, every call to tcp_seq_start() scans half the whole hashtable. Ouch.
This is O(numsockets * hashsize). 95-99% of 'cat </proc/net/tcp' is spent in
tcp_seq_start()->tcp_get_idx. This overhead is eliminated by new algorithm,
which is O(numsockets + hashsize).

    New algorithm.

New algorithms is O(numsockets + hashsize). We jump to the right
hash bucket in tcp_seq_start(), without scanning half the hash.
To jump right to the hash bucket corresponding to *pos in tcp_seq_start(),
we reuse three pieces of state (st->num, st->bucket, st->sbucket)
as follows:
 - we check that requested pos >= last seen pos (st->num), the typical case.
 - if so, we jump to bucket st->bucket
 - to arrive to the right item after beginning of st->bucket, we
keep in st->sbucket the position corresponding to the beginning of
bucket.

(4) Explanation of O( numsockets * hashsize) of old algorithm.

tcp_seq_start() is called once for every ~7 lines of netstat output
if readsize is 1kb, or once for every ~28 lines if readsize >= 4kb.
Since record length of /proc/net/tcp records is 150 bytes, formula for
number of calls to tcp_seq_start() is
            (numsockets * 150 / min(4096,readsize)).
Netstat uses 4kb readsize (newer versions), or 1kb (older versions).
Note that speed of old algorithm does not improve above 4kb blocksize.

Speed of the new algorithm does not depend on blocksize.

Speed of the new algorithm does not perceptibly depend on hashsize (which
depends on ramsize). Speed of old algorithm drops with bigger hashsize.

(5) Reporting order.

Reporting order is exactly same as before if hash does not change underfoot.
When hash elements come and go during report, reporting order will be
same as that of tcpdiag.

Signed-off-by: Yakov Lerner <redacted>

Does the netlink interface used by ss command have the problem?

No. It's  /proc/net/tcp that has fixable problem.

Yakov

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help