Re: [PATCH v2] vsock: replace virtio_vsock_pkt with sk_buff
From: Bobby Eshleman <hidden>
Date: 2022-10-18 03:42:59
Also in:
kvm, lkml
On Tue, Oct 04, 2022 at 09:55:55PM +0000, Bobby Eshleman wrote:
On Sat, Oct 15, 2022 at 12:49:59PM -0700, Cong Wang wrote:quoted
On Mon, Oct 03, 2022 at 12:11:39AM +0000, Bobby Eshleman wrote:quoted
On Thu, Oct 06, 2022 at 09:34:10AM +0200, Stefano Garzarella wrote:quoted
On Thu, Oct 06, 2022 at 03:08:12AM -0400, Michael S. Tsirkin wrote:quoted
On Wed, Oct 05, 2022 at 06:19:44PM -0700, Bobby Eshleman wrote:quoted
This patch replaces the struct virtio_vsock_pkt with struct sk_buff. Using sk_buff in vsock benefits it by a) allowing vsock to be extended for socket-related features like sockmap, b) vsock may in the future use other sk_buff-dependent kernel capabilities, and c) vsock shares commonality with other socket types. This patch is taken from the original series found here: https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/ (local) Small-sized packet throughput improved by ~5% (from 18.53 Mb/s to 19.51 Mb/s). Tested using uperf, 16B payloads, 64 threads, 100s, averaged from 10 test runs (n=10). This improvement is likely due to packet merging. Large-sized packet throughput decreases ~9% (from 27.25 Gb/s to 25.04 Gb/s). Tested using uperf, 64KB payloads, 64 threads, 100s, averaged from 10 test runs (n=10). Medium-sized packet throughput decreases ~5% (from 4.0 Gb/s to 3.81 Gb/s). Tested using uperf, 4k to 8k payload sizes picked randomly according to normal distribution, 64 threads, 100s, averaged from 10 test runs (n=10).It is surprizing to me that the original vsock code managed to outperform the new one, given that to my knowledge we did not focus on optimizing it.Yeah mee to.Indeed.quoted
From this numbers maybe the allocation cost has been reduced as it performs better with small packets. But with medium to large packets we perform worse, perhaps because previously we were allocating a contiguous buffer up to 64k? Instead alloc_skb() could allocate non-contiguous pages ? (which would solve the problems we saw a few days ago)I think this would be the case with alloc_skb_with_frags(), but internally alloc_skb() uses kmalloc() for the payload and sk_buff_head slab allocations for the sk_buff itself (all the more confusing to me, as the prior allocator also uses two separate allocations per packet).I think it is related to your implementation of virtio_transport_add_to_queue(), where you introduced much more complicated logic than before: - spin_lock_bh(&vsock->send_pkt_list_lock); - list_add_tail(&pkt->list, &vsock->send_pkt_list); - spin_unlock_bh(&vsock->send_pkt_list_lock); - + virtio_transport_add_to_queue(&vsock->send_pkt_queue, skb);I wish it were that easy, but I included this change because it actually boosts performance. For 16B payloads, this change improves throughput from 16 Mb/s to 20Mb/s in my test harness, and reduces the memory usage of the kmalloc-512 and skbuff_head_cache slab caches by ~50MB at cache size peak (total slab cache size from ~540MB to ~390MB), but typically (not at peak) the slab
Edit: from ~590MB to ~540MB. Mixed up numbers in editing the paragraph.
cache size when this merging is used keeps the memory slab caches closer to ~150MB smaller. Tests done using uperf.
For payloads greater than GOOD_COPY_LEN I don't see any any notable difference between the skb code with merging and the skb code without merging in terms of throughput. I assume this is because the skb->len comparison with GOOD_COPY_LEN should short circuit the expression and the other memory operations should not occur.quoted
A simple list_add_tail() is definitely faster than your virtio_transport_skbs_can_merge() check. So, why do you have to merge skb while we don't merge virtio_vsock_pkt?sk_buff is over twice the size of virtio_vsock_pkt (96B vs 232B). It seems wise to reduce the footprint in other ways to try and keep it comparable.quoted
_If_ you are trying to mimic TCP, I think you are doing it wrong, it can be much more efficient if you could do the merge in sendmsg() before skb is even allocated, see tcp_sendmsg_locked().I'll definitely give it a read, merging before allocating an skb sounds better. Best, Bobby