Re: [PATCH 5/5] aio: Refactor aio_read_evt, use cmxchg(), fix bug

From: Zach Brown <hidden>
Date: 2012-10-09 22:47:14
Also in: dm-devel, lkml

If libaio is the only thing in userspace looking at the ringbuffer, and
if I'm looking at the latest libaio code this shouldn't break
anything...

We can't assume that libaio is the only thing in userspace using the
mapped buffer -- as scary a thought as that is :).

If we wanted to change the behaviour of the ring we'd give userspace a
way to indicate that it accepts the new semantics.  I'm not immediately
sure how best to do that -- haven't given it much thought.

Sticking the head and tail pointers on the same cacheline is one of my
main complaints. If you could find that comment I'd be interested in
reading it, though.

Yeah, that was on my list.. Heh, found it!

/*
* This structure defines the head of a ring of events that is mapped into user
* space.  It isn't used anymore because it has a number of design flaws.  I'll
* mention them here in the hopes that people read this when considering a
* design for a ring shared between user and kernel space.
*
* - The head and tail pointers are in the same cacheline.  This introduces
* false sharing between producers and consumers which could otherwise operate
* independent of one-another, presuming that the producers know ahead of time
* that room has been reserved for them.
*
* - The head and tail semantics are unconventional.  They're always less than
* the number of events in the ring and their meaning is reversed from the
* usual construct that one sees in, for example, ethernet hardware where the
* ring is a power of two and the indexes wrap but are masked when used to
* dereference elements.
*
* - Because of the weird head and tail semantics one can't distinguish from
* completely empty and full rings.  To work around this the ring is actually
* created with more events than the aio user asked for and that number is
* accounted for separately.  The number of free elements in the ring is
* greater than the number of operations that the kernel aio submission will
* allow.
*
* - Synchronizing head and tail updates between the kernel and user space
* requires a u32 cmpxchg.  futexes have infrastructure for this which requires
* working with user addresses.  Trying to nicely re-use the futex code leads
* to awkward code in aio which sometimes wants to work through kmap()
* addresses and other times want to work through the mmaped user space
* pointers.
*
* - The head and tail pointers are restricted in value to being less than the
* number of elements in the ring.  This means that a cmpxchg can see a index
* which has wrapped and confuse it for the index that it read before trying
* cmpxchg.  This can end up sending a duplicated previous event instead of a
* currently queued event in the worst case.
*
* - User space doesn't explicitly indicate that it wants to work with the ring
* from user space.  Even if user space never touches the ring the kernel still
* has pay the cost of the potential sharing with user space when inserting
* events.
*
* - The kernel magically maps it into the address space.  Users can't put it
* in whatever memory they like, like shared mem or hugetlb pages or tmpfs
* pages that they want to sendfile out of.
*
* - The ring is allocated out of unswappable kernel memory.  It would be
* better to have the ring allocated in userspace and given to the kernel.  In
* the common case the pages will be present and the overhead is minimal and
* complexity is kept out of the kernel.  In the worst case of memory pressure
* the kernel has fewer pinned pages tying its hands.
*/

The submission and completion interface that I threw together for the
acall experiments took all this into account:

 http://lwn.net/Articles/316806/

Man, this all brings back memories :).

- z

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help