Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors

From: Avi Kivity <hidden>
Date: 2011-12-06 09:58:34
Also in: kvm, lkml

Possibly related (same subject, not in this thread)

2011-12-09 · Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors · Rusty Russell <hidden>
2011-12-08 · Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors · Sasha Levin <hidden>
2011-12-07 · Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors · Avi Kivity <hidden>
2011-11-29 · Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors · "Michael S. Tsirkin" <mst@redhat.com>
2011-11-29 · Re: [PATCH] virtio-ring: Use threshold for switching to indirect descriptors · Sasha Levin <hidden>

On 12/06/2011 07:07 AM, Rusty Russell wrote:

On Mon, 05 Dec 2011 11:52:54 +0200, Avi Kivity [off-list ref] wrote:

quoted

On 12/05/2011 02:10 AM, Rusty Russell wrote:

quoted

On Sun, 04 Dec 2011 17:16:59 +0200, Avi Kivity [off-list ref] wrote:

quoted

On 12/04/2011 05:11 PM, Michael S. Tsirkin wrote:

quoted

There's also the used ring, but that's a
mistake if you have out of order completion.  We should have used copying.

Seems unrelated... unless you want used to be written into
descriptor ring itself?

The avail/used rings are in addition to the regular ring, no?  If you
copy descriptors, then it goes away.

There were two ideas which drove the current design:

1) The Van-Jacobson style "no two writers to same cacheline makes rings
   fast" idea.  Empirically, this doesn't show any winnage.

Write/write is the same as write/read or read/write.  Both cases have to
send a probe and wait for the result.  What we really need is to
minimize cache line ping ponging, and the descriptor pool fails that
with ooo completion.  I doubt it's measurable though except with the
very fastest storage providers.

The claim was that going exclusive->shared->exclusive was cheaper than
exclusive->invalid->exclusive.  When VJ said it, it seemed convincing :)

IIRC at some point in time certain transitions were forced to go through
main memory; perhaps that was behind this.  I think these days
everything goes through the coherency protocol; not sure though.

quoted

2) Allowing a generic inter-guest copy mechanism, so we could have
   genuinely untrusted driver domains.  Yet noone ever did this so it's
   hardly a killer feature :(

It's still a goal, though not an important one.  But we have to
translate rings anyway, don't, since buffers are in guest physical
addresses, and we're moving into an address space that doesn't map those.

Yes, but the hypervisor/trusted party would simply have to do the copy;
the rings themselves would be shared A would say "copy this to/from B's
ring entry N" and you know that A can't have changed B's entry.

Sorry, I don't follow.  How can the rings be shared?  If A puts a gpa in
A's address space into the ring, there's no way B can do anything with
it, it's an opaque number.  Xen solves this with an extra layer of
indirection (grant table handles) that cost extra hypercalls to map or copy.

quoted

I thought of having a vhost-copy driver that could do ring translation,
using a dma engine for the copy.

As long as we get the length of data written from the vhost-copy driver
(ie. not just the network header).  Otherwise a malicious other guest
can send short packets, and a local process can read uninitialized
memory.  And pre-zeroing the buffers for this corner case sucks.

Right.  It's important to get the metadata visible at the right layer.

quoted

Also, stuff the cookie into len_and_flags as well.

Every driver really wants to put a pointer in there.  We have an array
to map desc. numbers to cookies inside the virtio core.

We really want 64 bits.

With multiqueue, it may be cheaper to do the extra translation locally
than to ship the cookie across cores (or, more likely, it will make no
difference).

However, moving pointers only works if you trust the other side.  That
doesn't work if we manage to share a ring.

quoted

2) Stick with physically-contiguous rings, but use them of size (2^n)-1.
   Makes the indexing harder, but that -1 lets us stash the indices in
   the first entry and makes the ring a nice 2^n size.

Allocate at lease a cache line for those.  The 2^n size is not really
material, a division is never necessary.

We free-run our indices, so we *do* a division (by truncation).  If we
limit indices to ringsize, then we have to handle empty/full confusion.

It's nice for simple OSes if things pack nicely into pages, but it's not
a killer feature IMHO.

Agree.

quoted

16kB worth of descriptors is 1024 entries.  With 4kB buffers, that's 4MB
worth of data, or 4 ms at 10GbE line speed.  With 1500 byte buffers it's
just 1.5 ms.  In any case I think it's sufficient.

Right. So I think that without indirect, we waste about 3 entries
per packet for virtio header and transport etc headers.

That does suck.  Are there issues in increasing the ring size?  Or
making it discontiguous?

Because the qemu implementation is broken.

I was talking about something else, but this is more important.  Every
time we make a simplifying assumption, it turns around and bites us, and
the code becomes twice as complicated as it would have been in the first
place, and the test matrix explodes.

True, though we seem to be improving.  But this is why I don't want
optional features in the spec; I want us always to exercise all of it.

Yes, and I think the main reason we're improving is that we actually
have a spec these days.  Agree about optional features, though of course
we accumulate them by not thinking of everything ahead of time, I don't
see how we can fix that.

quoted

We currently use small rings: the guest can't negotiate so qemu has to
offer a lowest-common-denominator value.  The new virtio-pci layout
fixes this, and lets the guest set the ring size.

Ok good.  Note the figuring out the best ring size needs some info from
the host, but that can be had from other channels.

We have a ringsize field; should the host initialize it as a hint?  Or
as a maximum allowable?

I'm just not sure how the host would even know to hint.

For JBOD storage, a good rule of thumb is (number of spindles) x 3. 
With less, you might leave an idle spindle; with more, you're just
adding latency.  This assumes you're using indirects so ring entry ==
request.  The picture is muddier with massive battery-backed RAID
controllers or flash.

For networking, you want (ring size) * min(expected packet size, page
size) / (link bandwidth) to be something that doesn't get the
bufferbloat people after your blood.

quoted

Can you take a peek at how Xen manages its rings?  They have the same
problems we do.

Yes, I made some mistakes, but I did steal from them in the first
place...

There was a bit of second system syndrome there.

But also some third syndrome: Xen had gone through their great device
rewrite, and things like feature bits were a direct response against
that complexity.  We did well there.

Yes, feature bits really proved themselves.

-- 
error compiling committee.c: too many arguments to function

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help