Thread (38 messages) 38 messages, 4 authors, 2011-05-26

Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: 2011-05-18 16:50:38
Also in: kvm, lkml

On Wed, May 18, 2011 at 01:40:29PM +0200, Michał Mirosław wrote:
W dniu 18 maja 2011 13:17 użytkownik Michael S. Tsirkin
[off-list ref] napisał:
quoted
On Wed, May 18, 2011 at 01:10:50PM +0200, Michał Mirosław wrote:
quoted
2011/5/18 Michael S. Tsirkin [off-list ref]:
quoted
On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
quoted
On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
quoted
2011/5/17 Shirley Ma [off-list ref]:
quoted
Hello Michael,

Looks like to use a new flag requires more time/work. I am thinking
whether we can just use HIGHDMA flag to enable zero-copy in macvtap
to
quoted
avoid the new flag for now since mavctap uses real NICs as lower
device?

Is there any other restriction besides requiring driver to not recycle
the skb? Are there any drivers that recycle TX skbs?
Not just recycling skbs, keeping reference to any of the pages in the
skb. Another requirement is to invoke the callback
in a timely fashion.  For example virtio-net doesn't limit the time until
that happens (skbs are only freed when some other packet is
transmitted), so we need to avoid zcopy for such (nested-virt)
scenarious, right?
Hmm. But every hardware driver supporting SG will keep reference to
the pages until the packet is sent (or DMA'd to the device). This can
take a long time if hardware queue happens to stall for some reason.
That's a fundamental property of zero copy transmit.
You can't let the application/guest reuse the memory until
no one looks at it anymore.
quoted
Is it that you mean keeping a reference after all skbs pointing to the
pages are released?
No one should reference the pages after the callback is invoked, yes.
quoted
quoted
quoted
quoted
Not more other restrictions, skb clone is OK. pskb_expand_head() looks
OK to me from code review.
Hmm. pskb_expand_head calls skb_release_data while keeping
references to pages. How is that ok? What do I miss?
It's making copy of the skb_shinfo earlier, so the pages refcount
stays the same.
Exactly. But the callback is invoked so the guest thinks it's ok to
change this memory. If it does a corrupted packet will be sent out.
Hmm. I tool a quick look at skb_clone(), and it looks like this
sequence will break this scheme:

skb2 = skb_clone(skb...);
kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
[use skb2, pages still referenced]
kfree_skb(skb); /* callback called again */
This sequence is common in bridge, might be in other places.

Maybe this ubuf thing should just track clones? This will make it work
on all devices then.

Best Regards,
Michał Mirosław
Well bridge has the problem that packet might get anywhere and it's
really hard to track. Same for tun - it can get queued forever.
veth, loopback are all a problem I think.

IOW we really want to limit this to real physical NICs
which mostly all DTRT. Whitelisting them with a new flag
is likely the most concervative approach, no?

-- 
MST
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help