Thread (22 messages) 22 messages, 5 authors, 2015-10-08

Re: Next round: revised futex(2) man page for review

From: Darren Hart <hidden>
Date: 2015-10-08 14:36:54
Also in: linux-man, lkml

On Wed, Oct 07, 2015 at 09:30:46AM +0100, Michael Kerrisk (man-pages) wrote:
Hello Thomas,

Thanks for the follow up!

Some open questions below are marked with the string ###.
A couple of comments from me below, although I suspect you have this much
covered already.
On 08/19/2015 04:17 PM, Thomas Gleixner wrote:
quoted
On Sat, 8 Aug 2015, Michael Kerrisk (man-pages) wrote:
quoted
quoted
quoted
       FUTEX_CMP_REQUEUE (since Linux 2.6.7)
              This  operation  first  checks  whether the location uaddr
              still contains the value  val3.   If  not,  the  operation
              fails  with  the  error  EAGAIN.  Otherwise, the operation
              wakes up a maximum of val waiters that are waiting on  the
              futex  at uaddr.  If there are more than val waiters, then
              the remaining waiters are removed from the wait  queue  of
              the  source  futex at uaddr and added to the wait queue of
              the target futex at uaddr2.  The val2  argument  specifies
              an  upper limit on the number of waiters that are requeued
              to the futex at uaddr2.

.\" FIXME(Torvald) Is the following correct?  Or is just the decision
.\" which threads to wake or requeue part of the atomic operation?

              The load from uaddr is  an  atomic  memory  access  (i.e.,
              using atomic machine instructions of the respective archi‐
              tecture).  This load, the comparison with  val3,  and  the
              requeueing  of  any  waiters  are performed atomically and
              totally ordered with respect to other  operations  on  the
              same futex word.
It's atomic as the other atomic operations on the futex word. It's
always performed with the proper lock(s) held in the kernel. That
means any concurrent operation will serialize on that lock(s). User
space has to make sure, that depending on the observed value no
concurrent operations happen, but that's something the kernel cannot
control.
???
Sorry, I'm not clear here. Is the current text correct then? Or is some
change needed.
I think we need some change here because the meaning of atomic is
unclear. The basic rules of futexes are:

 - All modifying operations on the futex value have to be done with
   atomic instructions, usually cmpxchg. That applies to both kernel
   and user space.

   That's the atomicity at the futex value level.

 - In the kernel we have to create/modify/destroy state in order to
   provide the blocking/requeueing etc.

   This state needs protection as well. So all operations related to
   the kernel internal state are serialized on the hash bucket
   locks. The hash buckets are a scalability mechanism to avoid
   contention on a single lock protecting all kernel internal
   state. For simplicity reasons you can just think of a global lock
   protecting all kernel internal state.

   If the kernel creates/modifies state then it can be necessary to
   either reread the futex value or modify it. That happens under the
   locks as well.

   So in the case of requeue, we take the proper locks and perform the
   comparison with val3 and the requeueing with the locks held.
   
   So that lock protection makes these operations 'atomic'. The
   correct expression is 'serialized'.
###
So, here, i think I need some specific pointers on the precise text
changes that are required. Let's talk about this f2f. For convenience,
here's the relevant text once again quoted:
Not speaking for tglx, but I think the point here is to distinguish between
atomic (as in cmpxchg comparison tests performed on the futex word) and
serialized (as in the management of futex hashbuckets and task states).
       FUTEX_CMP_REQUEUE (since Linux 2.6.7)
              This  operation  first  checks  whether the location uaddr
              still contains the value  val3.   If  not,  the  operation
              fails  with  the  error  EAGAIN.  Otherwise, the operation
Here you might explain the _CMP_ qualifier and note atomicity of the operation:

The _CMP_ refers to the verification of the userspace state as specified by
through the arguments. This operation first atomically compares the value at
uaddr with the value val3 ...

              wakes up a maximum of val waiters that are waiting on  the
              futex  at uaddr.  If there are more than val waiters, then
              the remaining waiters are removed from the wait  queue  of
              the  source  futex at uaddr and added to the wait queue of
              the target futex at uaddr2.  The val2  argument  specifies
              an  upper limit on the number of waiters that are requeued
              to the futex at uaddr2.
And here, perhaps:
These operations are serialized on locks protecting the internal datastructures
storing the kernel futex state.

I believe this paragraph is sufficient, but I'll comment on the second paragraph
below:
              The load from uaddr is  an  atomic  memory  access  (i.e.,
              using atomic machine instructions of the respective archi‐
              tecture).  This load, the comparison with  val3,
We are "atomic" up to here. ^

And what follows below is serialized by the hash bucket locks.:
               and  the
              requeueing  of  any  waiters  are performed atomically and
              totally ordered with respect to other  operations  on  the
              same futex word.
I personally think this is too much implementation detail for the man page,
especially for FUTEX_CMP_REQUEUE which does not manipulate the futex word value
from kernel space (We're not discussing PI here). The first paragraph is
sufficient to fully describe the guaranteed behavior of the opcode.



quoted
quoted
quoted
quoted
.\" FIXME We need some explanation in the following paragraph of *why*
.\"       it is important to note that "the kernel will update the
.\"       futex word's value prior
       It is important to note to returning to user space" . Can someone
       explain?   that  the  kernel  will  update the futex word's value
       prior to returning to user space.  Unlike the other futex  opera‐
       tions  described  above, the PI futex operations are designed for
       the implementation of very specific IPC mechanisms.
If there are multiple waiters on a pi futex then a wake pi operation
will wake the first waiter and hand over the lock to this waiter. This
includes handing over the rtmutex which represents the futex in the
kernel. The strict requirement is that the futex owner and the rtmutex
owner must be the same, except for the update period which is
serialized by the futex internal locking. That means the kernel must
update the user space value prior to returning to user space.
And as noted above: It must update while holding the proper locks.
(Okay.)

quoted
quoted
quoted
quoted
.\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
.\"       made the observation that "EINVAL is returned if the non-pi 
.\"       to pi or op pairing semantics are violated."
.\"       Probably there needs to be a general statement about this
.\"       requirement, probably located at about this point in the page.
.\"       Darren (or someone else), care to take a shot at this?
Well, that's hard to describe because the kernel only has a limited
way of detecting such mismatches. It only can detect it when there are
non PI waiters on a futex and a PI function is called or vice versa.
Hmmm. Okay, I filed your comments away for reference, but
hopefully someone can help with some actual text.
I let Darren come up with something sensible :)
Okay, let's see if Darren has something to say.
quoted
quoted
quoted
quoted
.\" FIXME Somewhere on this page (I guess under the discussion of PI
.\"       futexes) we need a discussion of the FUTEX_OWNER_DIED bit.
.\"       Can someone propose a text?
If a futex has a rtmutex associated in the kernel, i.e. when there are
blocked waiters, and the owner of the futex/rtmutex dies unexpectedly,
then the kernel cleans up the rtmutex (as it holds a reference to the
dying task) and hands it over to the next waiter. That requires that
the user space value is updated accordingly. The kernel sets the
FUTEX_OWNER_DIED in the user space value along with the TID of the new
owner. User space is responsible for cleaning this up, though there
are cases where the kernel does the cleanup.
 
The FUTEX_OWNER_DIED bit can also be set on uncontended futexes, where
the kernel has no state associated. This happens via the robust futex
mechanism. In that case the futex value will be set to
FUTEX_OWNER_DIED. The robust futex mechanism is also available for non
PI futexes.
???
So, I added part of that text to the page, as follows:

       If  a futex has an associated RT-mutex in the kernel (i.e., there
       are blocked waiters) and the owner  of  the  futex/RT-mutex  dies
       unexpectedly, then the kernel cleans up the RT-mutex and hands it
       over to the next waiter.  This in turn requires  that  the  user-
       space  value  is  updated  accordingly.  To indicate that this is
       required, the kernel sets the FUTEX_OWNER_DIED bit in  the  futex
       word  along  with  the thread ID of the new owner.  User space is
       then responsible for cleaning this up  (though  there  are  cases
       where the kernel does the cleanup).

Okay?

I think the last sentence still requires a little work though. What does
user space need to do in terms of clean up?
User space has usually state as well. So the FUTEX_OWNER_DIED bit
tells userspace that it needs to cleanup the stale state left over by
the dead owner.
###
So I changed the last sentence to 

    User space is then responsible for cleaning up the stale state left
    over by the dead owner.

Okay?
We can't say much beyond it's userspace's problem now because how it's handled
will vary widely. It might be worth mentioning that userspace can use
FUTEX_OWNER_DIED to detect this state and handle it appropriately.
quoted
quoted
quoted
quoted
.\" FIXME In the next line, what type of "priority" are we talking about?
.\"       Realtime priorities for SCHED_FIFO and SCHED_RR?
.\"       Or something else?

              The
              enqueueing  of  the waiter is in descending priority order
              if more than one waiter exists.  
That also covers sched deadline.
???
Thanks. If the realm is restricted purely to SCHED_OTHER (SCHED_NORMAL)
processes, does the nice value come into play also?
No. SCHED_OTHER/NORMAL tasks are handled in FIFO order.
Okay. Thanks.
quoted
quoted
So by now, I've reworked this text to be:

       FUTEX_TRYLOCK_PI (since Linux 2.6.18)
              This operation tries to acquire the futex at uaddr.  It is
              invoked when a user-space atomic acquire did  not  succeed
              because the futex word was not 0.

              The trylock in kernel might succeed because the futex word
              contains     stale     state     (FUTEX_WAITERS     and/or
              FUTEX_OWNER_DIED).   This can happen when the owner of the
              futex died.  User space cannot handle this condition in  a
              race-free manner

Okay?

I must admit that I find "the trylock in kernel might succeed"hard
to understand. Could you elaborate a little?
If the user space value has stale state, then the kernel can fix that
up and acquire the futex.

Okay, so I made the last sentence:

    User space cannot handle this condition in a race-free manner,
    but the kernel can fix this up and acquire the futex.

Okay?

However, I realize I should have said more clearly why I find the text
hard to understand. As I read this text:

    The trylock in kernel might succeed because the futex word
    contains stale state FUTEX_WAITERS and/or FUTEX_OWNER_DIED).

###

The words "The trylock in kernel might succeed" makes it feel
like a contrast is being drawn with some other scenario, but
it's not clear what that alternative/contrast is. Do you see
what I mean? Let's talk about this f2f.
I presume you got this covered in Dublin:
The trylock in the kernel has more state, so it can independently verify the
flags that userspace must trust implicitly.
quoted
quoted
???
So now I've reworded the opening text describing FUTEX_WAIT_REQUEUE_PI
as follows:

       FUTEX_WAIT_REQUEUE_PI (since Linux 2.6.31)
              Wait on  a  non-PI  futex  at  uaddr  and  potentially  be
              requeued  (via a FUTEX_CMP_REQUEUE_PI operation in another
              task) onto a PI futex at uaddr2.  The  wait  operation  on
              uaddr is the same as for FUTEX_WAIT.

              The  waiter  can be removed from the wait on uaddr without
              requeueing on uaddr2 via a FUTEX_WAIT operation in another
s/FUTEX_WAIT/FUTEX_WAKE/
Thanks. Fixed.
quoted
quoted
              task.   In  this case, the FUTEX_WAIT_REQUEUE_PI operation
              returns with the error EWOULDBLOCK.

Okay?
Yes.
Thanks.

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
-- 
Darren Hart
Intel Open Source Technology Center
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help