Thread (59 messages) 59 messages, 12 authors, 2016-02-02

Re: Optimizing instruction-cache, more packets at each stage

From: Tom Herbert <hidden>
Date: 2016-01-24 20:09:02

On Sun, Jan 24, 2016 at 6:28 AM, Jesper Dangaard Brouer
[off-list ref] wrote:
On Thu, 21 Jan 2016 10:54:01 -0800 (PST)
David Miller [off-list ref] wrote:
quoted
From: Jesper Dangaard Brouer <redacted>
Date: Thu, 21 Jan 2016 12:27:30 +0100
quoted
eth_type_trans() does two things:

1) determine skb->protocol
2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST}

Could the HW descriptor deliver the "proto", or perhaps just some bits
on the most common proto's?

The skb->pkt_type don't need many bits.  And I bet the HW already have
the information.  The BROADCAST and MULTICAST indication are easy.  The
PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST
indication, if the eth->h_dest match the devices dev->dev_addr (else a
SW compare is required).

Is that doable in hardware?
I feel like we've had this discussion before several years ago.

I think having just the protocol value would be enough.

skb->pkt_type we could deal with by using always an accessor and
evaluating it lazily.  Nothing needs it until we hit ip_rcv() or
similar.
First I thought, I liked the idea delaying the eval of skb->pkt_type.

BUT then I realized, what if we take this even further.  What if we
actually use this information, for something useful, at this very
early RX stage.

The information I'm interested in, from the HW descriptor, is if this
packet is NOT for local delivery.  If so, we can send the packet on a
"fast-forward" code path.

Think about bridging packets to a guest OS.  Because we know very
early at RX (from packet HW descriptor) we might even avoid allocating
a SKB.  We could just "forward" the packet-page to the guest OS.

Taking Eric's idea, of remote CPUs, we could even send these
packet-pages to a remote CPU (e.g. where the guest OS is running),
without having touched a single cache-line in the packet-data.  I
would still bundle them up first, to amortize the (100-133ns) cost of
transferring something to another CPU.
You mean like RPS/RFS/aRFS/flow_director already does (except for the
zero-touch part)?
The data-cache trick, would be to instruct prefetcher only to start
prefetching to L3 or L2, when these packet are destined for a remote
CPU.  At-least Intel CPUs have prefetch operations that specify only
L2/L3 cache.


Maybe, we need a combined solution.  Lazy eval skb->pkt_type, for
local delivery, but set the information if avail from HW desc.  And
fast page-forward don't even need a SKB.

--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help