Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more... | netdev

Optimizing instruction-cache, more packets at each stage · Jesper Dangaard Brouer <hidden> · 2016-01-15
Re: Optimizing instruction-cache, more packets at each stage · Hannes Frederic Sowa <hidden> · 2016-01-15
Re: Optimizing instruction-cache, more packets at each stage · Jesper Dangaard Brouer <hidden> · 2016-01-15
RE: Optimizing instruction-cache, more packets at each stage · David Laight <hidden> · 2016-01-15
Re: Optimizing instruction-cache, more packets at each stage · Jesper Dangaard Brouer <hidden> · 2016-01-15
Re: Optimizing instruction-cache, more packets at each stage · Felix Fietkau <hidden> · 2016-01-15
Re: Optimizing instruction-cache, more packets at each stage · Jesper Dangaard Brouer <hidden> · 2016-01-18
Re: Optimizing instruction-cache, more packets at each stage · Eric Dumazet <hidden> · 2016-01-18
Re: Optimizing instruction-cache, more packets at each stage · Florian Fainelli <f.fainelli@gmail.com> · 2016-01-25
Re: Optimizing instruction-cache, more packets at each stage · David Miller <davem@davemloft.net> · 2016-01-15
Re: Optimizing instruction-cache, more packets at each stage · Jesper Dangaard Brouer <hidden> · 2016-01-18
Re: Optimizing instruction-cache, more packets at each stage · David Miller <davem@davemloft.net> · 2016-01-18
Re: Optimizing instruction-cache, more packets at each stage · Or Gerlitz <hidden> · 2016-01-20
Re: Optimizing instruction-cache, more packets at each stage · Eric Dumazet <hidden> · 2016-01-20
Re: Optimizing instruction-cache, more packets at each stage · Tom Herbert <hidden> · 2016-01-20
Re: Optimizing instruction-cache, more packets at each stage · Jesper Dangaard Brouer <hidden> · 2016-01-21
Re: Optimizing instruction-cache, more packets at each stage · Or Gerlitz <hidden> · 2016-01-21
Re: Optimizing instruction-cache, more packets at each stage · Jesper Dangaard Brouer <hidden> · 2016-01-21
Re: Optimizing instruction-cache, more packets at each stage · David Miller <davem@davemloft.net> · 2016-01-21
Re: Optimizing instruction-cache, more packets at each stage · Or Gerlitz <hidden> · 2016-01-21
Re: Optimizing instruction-cache, more packets at each stage · David Miller <davem@davemloft.net> · 2016-01-21
Re: Optimizing instruction-cache, more packets at each stage · Eric Dumazet <hidden> · 2016-01-21
Re: Optimizing instruction-cache, more packets at each stage · David Miller <davem@davemloft.net> · 2016-01-21
Re: Optimizing instruction-cache, more packets at each stage · Jesper Dangaard Brouer <hidden> · 2016-01-24
Re: Optimizing instruction-cache, more packets at each stage · "Michael S. Tsirkin" <mst@redhat.com> · 2016-01-24
Re: Optimizing instruction-cache, more packets at each stage · John Fastabend <john.fastabend@gmail.com> · 2016-01-24
Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Jesper Dangaard Brouer <hidden> · 2016-01-25
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Tom Herbert <hidden> · 2016-01-25
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · John Fastabend <john.fastabend@gmail.com> · 2016-01-25
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Tom Herbert <hidden> · 2016-01-25
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · John Fastabend <john.fastabend@gmail.com> · 2016-01-25
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Jesper Dangaard Brouer <hidden> · 2016-01-25
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Jesper Dangaard Brouer <hidden> · 2016-01-27
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Alexei Starovoitov <hidden> · 2016-01-27
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Jesper Dangaard Brouer <hidden> · 2016-01-28
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Eric Dumazet <hidden> · 2016-01-28
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Eric Dumazet <hidden> · 2016-01-28
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Tom Herbert <hidden> · 2016-01-28
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Tom Herbert <hidden> · 2016-01-28
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Jesper Dangaard Brouer <hidden> · 2016-01-28
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Eric Dumazet <hidden> · 2016-01-28
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Tom Herbert <hidden> · 2016-01-28
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Eric Dumazet <hidden> · 2016-01-28
Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) · Jesper Dangaard Brouer <hidden> · 2016-01-28
Re: Optimizing instruction-cache, more packets at each stage · Tom Herbert <hidden> · 2016-01-24
Re: Optimizing instruction-cache, more packets at each stage · John Fastabend <john.fastabend@gmail.com> · 2016-01-24
Re: Optimizing instruction-cache, more packets at each stage · Tom Herbert <hidden> · 2016-01-24
Re: Optimizing instruction-cache, more packets at each stage · Jesper Dangaard Brouer <hidden> · 2016-01-21
Re: Optimizing instruction-cache, more packets at each stage · Tom Herbert <hidden> · 2016-01-21
Re: Optimizing instruction-cache, more packets at each stage · Eric Dumazet <hidden> · 2016-01-21
Re: Optimizing instruction-cache, more packets at each stage · Jesper Dangaard Brouer <hidden> · 2016-01-22
Re: Optimizing instruction-cache, more packets at each stage · Eric Dumazet <hidden> · 2016-01-22
Re: Optimizing instruction-cache, more packets at each stage · Tom Herbert <hidden> · 2016-01-22
Re: Optimizing instruction-cache, more packets at each stage · Jesper Dangaard Brouer <hidden> · 2016-01-22
Re: Optimizing instruction-cache, more packets at each stage · Or Gerlitz <hidden> · 2016-02-02
Re: Optimizing instruction-cache, more packets at each stage · Eric Dumazet <hidden> · 2016-02-02
Re: Optimizing instruction-cache, more packets at each stage · Eric Dumazet <hidden> · 2016-01-18
Re: Optimizing instruction-cache, more packets at each stage · Tom Herbert <hidden> · 2016-01-18
Re: Optimizing instruction-cache, more packets at each stage · Jesper Dangaard Brouer <hidden> · 2016-01-18

Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)

From: Eric Dumazet <hidden>
Date: 2016-01-28 12:54:25

On Thu, 2016-01-28 at 10:52 +0100, Jesper Dangaard Brouer wrote:

I'm still in flux/undecided how long we should delay the first touching
of pkt-data, which happens when calling eth_type_trans().  Should it
stay in the driver or not(?).

In the extreme case, for optimize for RPS sending to remote CPUs, delay
calling eth_type_trans() as long as possible.

1. In driver only start prefetch data to L2/L3 cache
2. Stack calls get_rps_cpu() and assume skb_get_hash() have HW hash
3. (Bulk) enqueue on remote_cpu->sd->input_pkt_queue
4. On remote CPU in process_backlog call eth_type_trans() on sd->input_pkt_queue


On the other hand, if the HW desc can provide skb->proto, and we can
lazy eval skb->pkt_type, then it is okay to keep that responsibility in
the driver (as the call to eth_type_trans() basically disappears).


Delaying means GRO wont be able to recycle its super hot skb (see
napi_get_frags())

You might optimize the reception of packets in the router case (poor GRO
aggregation rate), but you'll slow down GRO efficiency when receiving
nice GRO trains.

When we receive a train of 10 MSS, driver keeps using the same sk_buff,
very hot in its L1

(This was the original idea of build_skb() to get nice cache locality
for the metadata, since it is 4 cache lines per sk_buff)

Now most drivers have no clue why it is important to allocate the skb
_after_ receiving the ethernet frame and not in advance.

(The lazy drivers allocate ~1024 skbs to prefill their ~1024 slot RX
ring)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help