Thread (20 messages) 20 messages, 3 authors, 2010-11-04

Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: 2010-11-03 10:48:19
Also in: kvm, lkml

On Mon, Nov 01, 2010 at 01:17:53PM -0700, Shirley Ma wrote:
On Sat, 2010-10-30 at 22:06 +0200, Michael S. Tsirkin wrote:
quoted
On Fri, Oct 29, 2010 at 08:43:08AM -0700, Shirley Ma wrote:
quoted
On Fri, 2010-10-29 at 10:10 +0200, Michael S. Tsirkin wrote:
quoted
Hmm. I don't yet understand. We are still doing copies into the
per-vq
quoted
quoted
buffer, and the data copied is really small.  Is it about cache
line
quoted
quoted
bounces?  Could you try figuring it out?
per-vq buffer is much less expensive than 3 put_copy() call. I will
collect the profiling data to show that.
What about __put_user? Maybe the access checks are the ones
that add the cost here? I attach patches to strip access checks:
they are not needed as we do them on setup time already, anyway.
Can you try them out and see if performance is improved for you
please?
On top of this, we will need to add some scheme to accumulate signals,
but that is a separate issue.
Yes, moving from put_user/get_user to __put_user/__get_user does improve
the performance by removing the checking.
I mean in practice, you see a benefit from this patch?
My concern here is whether checking only in set up would be sufficient
for security?
It better be sufficient because the checks that put_user does
are not effictive when run from the kernel thread, anyway.
Would be there is a case guest could corrupt the ring
later? If not, that's OK.
You mean change the pointer after it's checked?
If you see such a case, please holler.
quoted
quoted
quoted
quoted
quoted
2. How about flushing out queued stuff before we exit
   the handle_tx loop? That would address most of
   the spec issue. 
The performance is almost as same as the previous patch. I will
resubmit
quoted
the modified one, adding vhost_add_used_and_signal_n after
handle_tx
quoted
quoted
quoted
loop for processing pending queue.

This patch was a part of modified macvtap zero copy which I
haven't
quoted
quoted
quoted
submitted yet. I found this helped vhost TX in general. This
pending
quoted
quoted
quoted
queue will be used by DMA done later, so I put it in vq instead
of a
quoted
quoted
quoted
local variable in handle_tx.

Thanks
Shirley
BTW why do we need another array? Isn't heads field exactly what
we
quoted
quoted
need
here?
head field is only for up to 32, the more used buffers add and
signal
quoted
accumulated the better performance is from test results.
I think we should separate the used update and signalling.  Interrupts
are expensive so I can believe accumulating even up to 100 of them
helps. But used head copies are already prety cheap. If we cut the
overhead by x32, that should make them almost free?
I can separate the used update and signaling to see the best
performance.
quoted
quoted
That's was one
of the reason I didn't use heads. The other reason was I used these
buffer for pending dma done in mavctap zero copy patch. It could be
up
quoted
to vq->num in worse case.
We can always increase that, not an issue. 
Good, I will change heads up to vq->num and use it.

Thanks
Shirley
To clarify: the combination of __put_user and separate
signalling is giving the same performance benefit as your
patch?

I am mostly concerned with adding code that seems to help
speed for reasons we don't completely understand, because
then we might break the optimization easily without noticing.

-- 
MST
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help