RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel
From: Xin, Xiaohui <hidden>
Date: 2010-09-15 01:51:31
Also in:
kvm, lkml
From: Shirley Ma [mailto:mashirle@us.ibm.com] Sent: Tuesday, September 14, 2010 11:05 PM To: Avi Kivity Cc: David Miller; arnd@arndb.de; mst@redhat.com; Xin, Xiaohui; netdev@vger.kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote:quoted
quoted
quoted
+ base = (unsigned long)from->iov_base + offset1; + size = ((base& ~PAGE_MASK) + len + ~PAGE_MASK)>>PAGE_SHIFT;quoted
quoted
+ num_pages = get_user_pages_fast(base, size,0,&page[i]);quoted
quoted
+ if ((num_pages != size) || + (num_pages> MAX_SKB_FRAGS -skb_shinfo(skb)->nr_frags))quoted
quoted
+ /* put_page is in skb free */ + return -EFAULT;What keeps the user from writing to these pages in it's addressspacequoted
after the write call returns? A write() return of success means: "I wrote what you gave to me" not "I wrote what you gave to me, oh and BTW don't touch these pages for a while." In fact "a while" isn't even defined in any way, as there is no way for the write() invoker to know when the networking card is donewithquoted
those pages.That's what io_submit() is for. Then io_getevents() tells you what "a while" actually was.This macvtap zero copy uses iov buffers from vhost ring, which is allocated from guest kernel. In host kernel, vhost calls macvtap sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers' pages for zero copy. The patch is relying on how vhost handle these buffers. I need to look at vhost code (qemu) first for addressing the questions here. Thanks Shirley
I think what David said is what we have thought before in mp device. Since we are not sure the exact time the tx buffer was wrote though DMA operation. But the deadline is when the tx buffer was freed. So we only notify the vhost stuff about the write when tx buffer freed. But the deadline is maybe too late for performance. Thanks Xiaohui