Re: [PATCH v3 net-next 3/5] tcp: Access &tcp_hashinfo via net.

From: Kuniyuki Iwashima <hidden>
Date: 2022-09-03 01:13:15

From:   Eric Dumazet <edumazet@google.com>
Date:   Fri, 2 Sep 2022 17:53:18 -0700

On Fri, Sep 2, 2022 at 5:44 PM Kuniyuki Iwashima [off-list ref] wrote:

quoted

From:   Kuniyuki Iwashima <redacted>
Date:   Thu, 1 Sep 2022 15:12:16 -0700

quoted

From:   Eric Dumazet <edumazet@google.com>
Date:   Thu, 1 Sep 2022 14:30:43 -0700

quoted

On Thu, Sep 1, 2022 at 2:25 PM Kuniyuki Iwashima [off-list ref] wrote:

quoted

From:   Paolo Abeni <pabeni@redhat.com>

quoted

/Me is thinking aloud...

I'm wondering if the above has some measurable negative effect for
large deployments using only the main netns?

Specifically, are net->ipv4.tcp_death_row and net->ipv4.tcp_death_row-

quoted

hashinfo already into the working set data for established socket?

Would the above increase the WSS by 2 cache-lines?

Currently, the death_row and hashinfo are touched around tw sockets or
connect().  If connections on the deployment are short-lived or frequently
initiated by itself, that would be host and included in WSS.

If the workload is server and there's no active-close() socket or
connections are long-lived, then it might not be included in WSS.
But I think it's not likely than the former if the deployment is
large enough.

If this change had large impact, then we could revert fbb8295248e1
which converted net->ipv4.tcp_death_row into pointer for 0dad4087a86a
that tried to fire a TW timer after netns is freed, but 0dad4087a86a
has already reverted.


Concern was fast path.

Each incoming packet does a socket lookup.

Fetching hashinfo (instead of &tcp_hashinfo) with a dereference of a
field in 'struct net' might inccurr a new cache line miss.

Previously, first cache line of tcp_info was enough to bring a lot of
fields in cpu cache.

Ok, let me test on that if there could be regressions.

I tested tcp_hashinfo vs tcp_death_row->hashinfo with super_netperf
and collected HW cache-related metrics with perf.

After the patch the number of L1 miss seems to increase, but the
instructions per cycle also increases, and cache miss rate did not
change.  Also, there was not performance regression for netperf.


Tested:

# cat perf_super_netperf
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 3 > /proc/sys/vm/drop_caches

perf stat -a \
     -e cycles,instructions,cache-references,cache-misses,bus-cycles \
     -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores \
     -e dTLB-loads,dTLB-load-misses \
     -e LLC-loads,LLC-load-misses,LLC-stores \
     ./super_netperf $(($(nproc) * 2)) -H 10.0.0.142 -l 60 -fM

echo 1 > /proc/sys/kernel/nmi_watchdog


Before:

# ./perf_super_netperf
2929.81

 Performance counter stats for 'system wide':

   494,002,600,338      cycles                                                        (23.07%)
   241,230,662,890      instructions              #    0.49  insn per cycle           (30.76%)
     6,303,603,008      cache-references                                              (38.45%)
     1,421,440,332      cache-misses              #   22.550 % of all cache refs      (46.15%)
     4,861,179,308      bus-cycles                                                    (46.15%)
    65,410,735,599      L1-dcache-loads                                               (46.15%)
    12,647,247,339      L1-dcache-load-misses     #   19.34% of all L1-dcache accesses  (30.77%)
    32,912,656,369      L1-dcache-stores                                              (30.77%)
    66,015,779,361      dTLB-loads                                                    (30.77%)
        81,293,994      dTLB-load-misses          #    0.12% of all dTLB cache accesses  (30.77%)
     2,946,386,949      LLC-loads                                                     (30.77%)
       257,223,942      LLC-load-misses           #    8.73% of all LL-cache accesses  (30.77%)
     1,183,820,461      LLC-stores                                                    (15.38%)

      62.132250590 seconds time elapsed

This test will not be able to see a difference really...

What is needed is to measure the latency when nothing at all is in the caches.

Vast majority of real world TCP traffic is light or moderate.
Packets are received and cpu has to bring X cache lines into L1 in
order to process one packet.

We slowly are increasing X over time :/

pahole is your friend, more than a stress-test.

Here's pahole result on my local build.  As Paolo said, we
need 2 cachelines for tcp_death_row and the hashinfo?

How about moving hashinfo as the first member of struct
inet_timewait_death_row and convert it to just struct
instead of pointer so that we need 1 cache line to read
hashinfo?

$ pahole -C netns_ipv4,inet_timewait_death_row vmlinux
struct netns_ipv4 {
	struct inet_timewait_death_row * tcp_death_row;  /*     0     8 */
	struct ctl_table_header *  forw_hdr;             /*     8     8 */
	struct ctl_table_header *  frags_hdr;            /*    16     8 */
	struct ctl_table_header *  ipv4_hdr;             /*    24     8 */
	struct ctl_table_header *  route_hdr;            /*    32     8 */
	struct ctl_table_header *  xfrm4_hdr;            /*    40     8 */
	struct ipv4_devconf *      devconf_all;          /*    48     8 */
	struct ipv4_devconf *      devconf_dflt;         /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	...
};
struct inet_timewait_death_row {
	refcount_t                 tw_refcount;          /*     0     4 */

	/* XXX 60 bytes hole, try to pack */

	/* --- cacheline 1 boundary (64 bytes) --- */
	struct inet_hashinfo *     hashinfo __attribute__((__aligned__(64))); /*    64     8 */
	int                        sysctl_max_tw_buckets; /*    72     4 */

	/* size: 128, cachelines: 2, members: 3 */
	/* sum members: 16, holes: 1, sum holes: 60 */
	/* padding: 52 */
	/* forced alignments: 1, forced holes: 1, sum forced holes: 60 */
} __attribute__((__aligned__(64)));

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help