Re: For review: documentation of clone3() system call

From: Jann Horn <jannh@google.com>
Date: 2019-10-28 15:12:39
Also in: linux-man, lkml

On Fri, Oct 25, 2019 at 6:59 PM Michael Kerrisk (man-pages)
[off-list ref] wrote:

I've made a first shot at adding documentation for clone3(). You can
see the diff here:
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=faa0e55ae9e490d71c826546bbdef954a1800969

[...]

   clone3()
       The  clone3() system call provides a superset of the functionality
       of the older clone() interface.  It also provides a number of  API
       improvements,  including: space for additional flags bits; cleaner
       separation in the use of various arguments;  and  the  ability  to
       specify the size of the child's stack area.

You might want to note somewhere that its flags can't be
seccomp-filtered because they're stored in memory, making it
inappropriate to use in heavily sandboxed processes.

           struct clone_args {
               u64 flags;        /* Flags bit mask */
               u64 pidfd;        /* Where to store PID file descriptor
                                    (int *) */
               u64 child_tid;    /* Where to store child TID,
                                    in child's memory (int *) */
               u64 parent_tid;   /* Where to store child TID,
                                    in parent's memory (int *) */
               u64 exit_signal;  /* Signal to deliver to parent on
                                    child termination */
               u64 stack;        /* Pointer to lowest byte of stack */
               u64 stack_size;   /* Size of stack */
               u64 tls;          /* Location of new TLS */
           };

       The size argument that is supplied to clone3() should be  initial‐
       ized  to  the  size of this structure.  (The existence of the size
       argument permits future extensions to the clone_args structure.)

       The stack for the child process is  specified  via  cl_args.stack,
       which   points   to  the  lowest  byte  of  the  stack  area,  and

Here and in the comment in the struct above, you say that .stack
"points to the lowest byte of the stack area", but isn't that
architecture-dependent? For most architectures, I think it should
instead be "is the initial stack pointer", with the exception of IA64
(and maybe others, I'm not sure). For example, on X86, when launching
a thread with an initially empty stack, it points directly *after* the
end of the stack area.

       cl_args.stack_size, which specifies  the  size  of  the  stack  in
       bytes.   In the case where the CLONE_VM flag (see below) is speci‐

stack_size is ignored on most architectures.

       fied, a stack must be explicitly allocated and specified.   Other‐
       wise,  these  two  fields  can  be  specified as NULL and 0, which
       causes the child to use the same stack area as the parent (in  the
       child's own virtual address space).

[...]

   Equivalence between clone() and clone3() arguments
       Unlike  the  older  clone()  interface, where arguments are passed
       individually, in the newer clone3() interface  the  arguments  are
       packaged  into  the clone_args structure shown above.  This struc‐
       ture allows for a superset  of  the  information  passed  via  the
       clone() arguments.

       The following table shows the equivalence between the arguments of
       clone() and the fields in  the  clone_args  argument  supplied  to
       clone3():

              clone()         clone(3)        Notes
                              cl_args field
              flags & ~0xff   flags
              parent_tid      pidfd           See CLONE_PIDFD
              child_tid       child_tid       See CLONE_CHILD_SETTID
              parent_tid      parent_tid      See CLONE_PARENT_SETTID
              flags & 0xff    exit_signal
              stack           stack

              ---             stack_size

(except that on ia64, stack_size also exists in clone2(), and if
you're not on ia64, stack_size doesn't do anything, at least on X86,
so showing them side by side like this doesn't really make sense)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help