Thread (59 messages) 59 messages, 12 authors, 2016-02-02

Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)

From: Jesper Dangaard Brouer <hidden>
Date: 2016-01-25 22:10:25

On Mon, 25 Jan 2016 09:50:16 -0800 John Fastabend [off-list ref] wrote:
On 16-01-25 09:09 AM, Tom Herbert wrote:
quoted
On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer
[off-list ref] wrote:  
quoted
[...]
quoted
quoted
There are two ideas, getting mixed up here.  (1) bundling from the
RX-ring, (2) allowing to pick up the "packet-page" directly.

Bundling (1) is something that seems natural, and which help us
amortize the cost between layers (and utilizes icache better). Lets
keep that in another thread.

This (2) direct forward of "packet-pages" is a fairly extreme idea,
BUT it have the potential of being an new integration point for
"selective" bypass-solutions and bringing RAW/af_packet (RX) up-to
speed with bypass-solutions.
[...]
Jesper, at least for you (2) case what are we missing with the
bifurcated/queue splitting work? Are you really after systems
without SR-IOV support or are you trying to get this on the order
of queues instead of VFs.
I'm not saying something is missing for bifurcated/queue splitting work.
I'm not trying to work-around SR-IOV.

This an extreme idea, which I got while looking at the lowest RX layer.


Before working any further on this idea/path, I need/want to evaluate
if it makes sense from a performance point of view.  I need to evaluate
if "pulling" out these "packet-pages" is fast enough to compete with
DPDK/netmap.  Else it makes no sense to work on this path.

As a first step to evaluate this lowest RX layer, I'm simply hacking
the drivers (ixgbe and mlx5) to drop/discard packets within-the-driver.
For now, simply replacing napi_gro_receive() with dev_kfree_skb(), and
measuring the "RX-drop" performance.

Next step was to avoid the skb alloc+free calls, but doing so is more
complicated that I first anticipated, as the SKB is tied in fairly
heavily.  Thus, right now I'm instead hooking in my bulk alloc+free
API, as that will remove/mitigate most of the overhead of the
kmem_cache/slab-allocators.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help