Re: [RFC PATCH 1/2] af_packet: direct dma for packet ineterface

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: 2017-01-31 01:38:50

quoted

V3 header formats added bulk polling via socket calls and timers
used in the polling interface to return every n milliseconds. Currently,
I don't see any way to support this in hardware because we can't
know if the hardware is in the middle of a DMA operation or not
on a slot. So when a timer fires I don't know how to advance the
descriptor ring leaving empty descriptors similar to how the software
ring works. The easiest (best?) route is to simply not support this.

From a performance pov bulking is essential. Systems like netmap that
also depend on transferring control between kernel and userspace,
report[1] that they need at least bulking size 8, to amortize the overhead.

To introduce interrupt moderation, ixgbe_do_ddma only has to elide the
sk_data_ready, and schedule an hrtimer if one is not scheduled yet.

If I understand correctly, the difficulty lies in v3 requiring that the
timer "close" the block when the timer expires. That may not be worth
implementing, indeed.

Hardware interrupt moderation and napi may already give some
moderation, even with a sock_def_readable call for each packet. If
considering a v4 format, I'll again suggest virtio virtqueues. Those
have interrupt suppression built in with EVENT_IDX.

quoted

Likely, but I would like that we do a measurement based approach.  Lets
benchmark with this V2 header format, and see how far we are from
target, and see what lights-up in perf report and if it is something we
can address.

Yep I'm hoping to get to this sometime this week.

Perhaps also without filling in the optional metadata data fields
in tpacket and sockaddr_ll.

quoted

E.g. how will you support XDP_TX?  AFAIK you cannot remove/detach a
packet with this solution (and place it on a TX queue and wait for DMA
TX completion).

This is something worth exploring. tpacket_v2 uses a fixed ring with
slots so all the pages are allocated and assigned to the ring at init
time. To xmit a packet in this case the user space application would
be required to leave the packet descriptor on the rx side pinned
until the tx side DMA has completed. Then it can unpin the rx side
and return it to the driver. This works if the TX/RX processing is
fast enough to keep up. For many things this is good enough.

For some work loads though this may not be sufficient. In which
case a tpacket_v4 would be useful that can push down a new set
of "slots" every n packets. Where n is sufficiently large to keep
the workload running.

Here, too, virtio rings may help.

The extra level of indirection allows out of order completions,
reducing the chance of running out of rx descriptors when redirecting
a subset of packets to a tx ring, as that does not block the entire ring.

And passing explicit descriptors from userspace enables pointing to
new memory regions. On the flipside, they now have to be checked for
safety against region bounds.

This is similar in many ways to virtio/vhost interaction.

Ah, I only saw this after writing the above :)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help