Re: [Xen-devel] [PATCH 0001/001] xen: multi page ring support for block devices

From: Jan Beulich <hidden>
Date: 2012-03-14 08:35:24
Also in: virtualization

quoted

On 14.03.12 at 07:32, Justin Gibbs [off-list ref] wrote:

There's another problem here that I brought up during the Xen
Hack-a-thon.  The ring macros require that the ring element count
be a power of two.  This doesn't mean that the ring will be a power
of 2 pages in size.  To illustrate this point, I modified the FreeBSD
blkback driver to provide negotiated ring stats via sysctl.

Here's a connection to a Windows VM running the Citrix PV drivers:

    dev.xbbd.2.max_requests: 128
    dev.xbbd.2.max_request_segments: 11
    dev.xbbd.2.max_request_size: 45056
    dev.xbbd.2.ring_elem_size: 108  <= 32bit ABI
    dev.xbbd.2.ring_pages: 4
    dev.xbbd.2.ring_elements: 128
    dev.xbbd.2.ring_waste: 2496

Over half a page is wasted when ring-page-order is 2.  I'm sure you
can see where this is going.  :-)

Here are the limits published by our backend to the XenStore:

    max-ring-pages = "113"
    max-ring-page-order = "7"
    max-requests = "256"
    max-request-segments = "129"
    max-request-size = "524288"

Because we allow so many concurrent, large requests in our product,
the ring wastage really adds up if the front end doesn't support
the "ring-pages" variant of the extension.  However, you only need
a ring-page-order of 3 with this protocol to start seeing pages of
wasted ring space.

You don't really want to negotiate "ring-pages" either.  The backends
often need to support multiple ABIs.  I can easily construct a set
of limits for the FreeBSD blkback driver which will cause the ring
limits to vary by a page between the 32bit and 64bit ABIs.

With all this in mind, the backend must do a dance of rounding up,
taking the max of the ring sizes for the different ABIs, and then
validating the front-end published limits taking its ABI into
account.  The front-end does some of this too.  Its way too messy
and error prone because we don't communicate the ring element limit
directly.

"max-ring-element-order" anyone? :-)

Interesting observation - yes, I think deprecating both pre-existing
methods in favor of something along those lines would be desirable.
(But I'd favor not using the term "order" here as it is - at least in
Linux - usually implied to be used on pages. "max-ringent-log2"
perhaps?)

What you say also implies that all currently floating around Linux
backend patches are flawed in their way of calculating the number
of ring entries, as this number really depends on the protocol the
frontend advertises.

Further, if you're concerned about wasting ring space (and
particularly in the context of your request number/size/segments
extension), shouldn't we bother to define pairs (or larger groups)
of struct blkif_request_segment (as currently a quarter of the space
is mere padding)? Or split grefs from {first,last}_sect altogether?

Finally, while looking at all this again, I stumbled across the use
of blkif_vdev_t in the ring structures: At least Linux'es blkback
completely ignores this field - {xen_,}vbd_translate() simply
overwrites what dispatch_rw_block_io() put there (and with this,
struct phys_req's dev and bdev members seem rather pointless too).
Does anyone recall what the original intention with this request field
was? Allowing I/O on multiple devices over a single ring?

Bottom line - shouldn't we define a blkif2 interface to cleanly
accommodate all the various extensions (and do away with the
protocol variations)?

Jan

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help