Re: [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched processing
From: Jason Wang <jasowang@redhat.com>
Date: 2017-09-27 02:04:28
Also in:
kvm, lkml
On 2017年09月27日 03:25, Michael S. Tsirkin wrote:
On Fri, Sep 22, 2017 at 04:02:35PM +0800, Jason Wang wrote:quoted
This patch implements basic batched processing of tx virtqueue by prefetching desc indices and updating used ring in a batch. For non-zerocopy case, vq->heads were used for storing the prefetched indices and updating used ring. It is also a requirement for doing more batching on top. For zerocopy case and for simplicity, batched processing were simply disabled by only fetching and processing one descriptor at a time, this could be optimized in the future. XDP_DROP (without touching skb) on tun (with Moongen in guest) with zercopy disabled: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz: Before: 3.20Mpps After: 3.90Mpps (+22%) No differences were seen with zerocopy enabled. Signed-off-by: Jason Wang <jasowang@redhat.com>So where is the speedup coming from? I'd guess the ring is hot in cache, it's faster to access it in one go, then pass many packets to net stack. Is that right? Another possibility is better code cache locality.
Yes, I think the speed up comes from: - less cache misses - less cache line bounce when virtqueue is about to be full (guest is faster than host which is the case of MoonGen) - less memory barriers - possible faster copy speed by using copy_to_user() on modern CPUs
So how about this patchset is refactored:
1. use existing APIs just first get packets then
transmit them all then use them allLooks like current API can not get packets first, it only support get packet one by one (if you mean vhost_get_vq_desc()). And used ring updating may get more misses in this case.
2. add new APIs and move the loop into vhost core
for more speedupsI don't see any advantages, looks like just need some e.g callbacks in this case. Thanks