Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

[RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 02/24] Restartable sequences: ARM 32 architecture support · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 06/24] Restartable sequences: powerpc architecture support · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 03/24] Restartable sequences: wire up ARM 32 system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd · Shuah Khan <hidden> · 2017-11-17
Re: [RFC PATCH for 4.15 23/24] membarrier: selftest: Test private expedited sync core cmd · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-17
[RFC PATCH v2 for 4.15 21/24] x86: Introduce sync_core_before_usermode · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 18/24] membarrier: provide SHARED_EXPEDITED command · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH for 4.15 18/24] membarrier: provide SHARED_EXPEDITED command · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-15
[RFC PATCH for 4.15 19/24] membarrier: selftest: Test shared expedited cmd · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH for 4.15 19/24] membarrier: selftest: Test shared expedited cmd · Shuah Khan <hidden> · 2017-11-17
[RFC PATCH for 4.15 20/24] membarrier: Provide core serializing command · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v7 for 4.15 16/24] membarrier: powerpc: Skip memory barrier in switch_mm() · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v5 for 4.15 17/24] membarrier: Document scheduler barrier requirements · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v2 for 4.15 22/24] membarrier: x86: Provide core serializing command · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v2 for 4.15 13/24] Restartable sequences: Provide self-tests · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v2 for 4.15 15/24] membarrier: selftest: Test private expedited cmd · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v2 for 4.15 12/24] cpu_opv: Implement selftests · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 14/24] Restartable sequences selftests: arm: workaround gcc asm size guess · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 24/24] membarrier: arm64: Provide core serializing command · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 09/24] cpu_opv: Wire up x86 32/64 system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 11/24] cpu_opv: Wire up ARM32 system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-15
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Michael Kerrisk (man-pages) <hidden> · 2017-11-15
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-15
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-16
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Andi Kleen <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Andi Kleen <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Andi Kleen <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Andy Lutomirski <luto@amacapital.net> · 2017-11-18
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-17
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Andi Kleen <hidden> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Thomas Gleixner <hidden> · 2017-11-20
Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-21
[RFC PATCH for 4.15 10/24] cpu_opv: Wire up powerpc system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 07/24] Restartable sequences: Wire up powerpc system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support · Thomas Gleixner <hidden> · 2017-11-16
Re: [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-19
Re: [RFC PATCH for 4.15 04/24] Restartable sequences: x86 32/64 architecture support · Thomas Gleixner <hidden> · 2017-11-20
[RFC PATCH for 4.15 05/24] Restartable sequences: wire up x86 32/64 system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
[RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Ben Maurer <hidden> · 2017-11-14
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Ben Maurer <hidden> · 2017-11-14
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Peter Zijlstra <peterz@infradead.org> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Peter Zijlstra <peterz@infradead.org> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Peter Zijlstra <peterz@infradead.org> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Thomas Gleixner <hidden> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Peter Zijlstra <peterz@infradead.org> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Peter Zijlstra <peterz@infradead.org> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Thomas Gleixner <hidden> · 2017-11-16
Re: [RFC PATCH v11 for 4.15 01/24] Restartable sequences system call · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-19
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Linus Torvalds <torvalds@linux-foundation.org> · 2017-11-14
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Andy Lutomirski <luto@amacapital.net> · 2017-11-14
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-14
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Andy Lutomirski <luto@amacapital.net> · 2017-11-15
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2017-11-15
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Paul Turner <hidden> · 2017-11-14
Re: [RFC PATCH for 4.15 00/24] Restartable sequences and CPU op vector v11 · Mathieu Desnoyers <mathieu.desnoyers@efficios.com> · 2018-03-27

From: Mathieu Desnoyers <hidden>
Date: 2017-11-20 16:12:25
Also in: lkml

----- On Nov 16, 2017, at 6:26 PM, Thomas Gleixner tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org wrote:

On Tue, 14 Nov 2017, Mathieu Desnoyers wrote:

quoted

+#ifdef __KERNEL__
+# include <linux/types.h>
+#else	/* #ifdef __KERNEL__ */

 		   Sigh.

fixed.

quoted

+# include <stdint.h>
+#endif	/* #else #ifdef __KERNEL__ */
+
+#include <asm/byteorder.h>
+
+#ifdef __LP64__
+# define CPU_OP_FIELD_u32_u64(field)			uint64_t field
+# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)	field = (intptr_t)v
+#elif defined(__BYTE_ORDER) ? \
+	__BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+# define CPU_OP_FIELD_u32_u64(field)	uint32_t field ## _padding, field
+# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
+	field ## _padding = 0, field = (intptr_t)v
+#else
+# define CPU_OP_FIELD_u32_u64(field)	uint32_t field, field ## _padding
+# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)	\
+	field = (intptr_t)v, field ## _padding = 0
+#endif

So in the rseq patch you have:

+#ifdef __LP64__
+# define RSEQ_FIELD_u32_u64(field)                     uint64_t field
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)     field = (intptr_t)v
+#elif defined(__BYTE_ORDER) ? \
+       __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+# define RSEQ_FIELD_u32_u64(field)     uint32_t field ## _padding, field
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)     \
+       field ## _padding = 0, field = (intptr_t)v
+#else
+# define RSEQ_FIELD_u32_u64(field)     uint32_t field, field ## _padding
+# define RSEQ_FIELD_u32_u64_INIT_ONSTACK(field, v)     \
+       field = (intptr_t)v, field ## _padding = 0
+#endif

IOW the same macro maze. Please use a separate header file which provides
these macros once and share them between the two facilities.

ok. I'll introduce uapi/linux/types_32_64.h, and prefix defines with
"LINUX_":

LINUX_FIELD_u32_u64()

unless other names are preferred. It will be in a separate patch.

quoted

+#define CPU_OP_VEC_LEN_MAX		16
+#define CPU_OP_ARG_LEN_MAX		24
+/* Max. data len per operation. */
+#define CPU_OP_DATA_LEN_MAX		PAGE_SIZE

That's something between 4K and 256K depending on the architecture.

You really want to allow up to 256K data copy with preemption disabled?
Shudder.

This is the max per operation. Following peterz' comments at KS, I added a
limit of 4096 + 15 * 8 on the sum of len for all operations in a vector. This
is defined below as CPU_OP_VEC_DATA_LEN_MAX.

So each of the 16 op cannot have a len larger than PAGE_SIZE, so we
pin at most 4 pages per op (e.g. memcpy 2 pages for src, 2 pages for dst),
*and* the sum of all ops len needs to be <= 4216. So the max limit you are
interested in here is the 4216 bytes limit.

quoted

+/*
+ * Max. data len for overall vector. We to restrict the amount of

We to ?

fixed

quoted

+ * user-space data touched by the kernel in non-preemptible context so
+ * we do not introduce long scheduler latencies.
+ * This allows one copy of up to 4096 bytes, and 15 operations touching
+ * 8 bytes each.
+ * This limit is applied to the sum of length specified for all
+ * operations in a vector.
+ */
+#define CPU_OP_VEC_DATA_LEN_MAX		(4096 + 15*8)

Magic numbers. Please use proper defines for heavens sake.

ok, it becomes:

#define CPU_OP_MEMCPY_EXPECT_LEN        4096
#define CPU_OP_EXPECT_LEN               8
#define CPU_OP_VEC_DATA_LEN_MAX         \
        (CPU_OP_MEMCPY_EXPECT_LEN +     \
         (CPU_OP_VEC_LEN_MAX - 1) * CPU_OP_EXPECT_LEN)

quoted

+#define CPU_OP_MAX_PAGES		4	/* Max. pages per op. */

Actually, I'll move the CPU_OP_MAX_PAGES define to cpu_opv.c. It's not
needed in the uapi header.

quoted

+
+enum cpu_op_type {
+	CPU_COMPARE_EQ_OP,	/* compare */
+	CPU_COMPARE_NE_OP,	/* compare */
+	CPU_MEMCPY_OP,		/* memcpy */
+	CPU_ADD_OP,		/* arithmetic */
+	CPU_OR_OP,		/* bitwise */
+	CPU_AND_OP,		/* bitwise */
+	CPU_XOR_OP,		/* bitwise */
+	CPU_LSHIFT_OP,		/* shift */
+	CPU_RSHIFT_OP,		/* shift */
+	CPU_MB_OP,		/* memory barrier */
+};
+
+/* Vector of operations to perform. Limited to 16. */
+struct cpu_op {
+	int32_t op;	/* enum cpu_op_type. */
+	uint32_t len;	/* data length, in bytes. */

Please get rid of these tail comments

ok

quoted

+	union {
+#define TMP_BUFLEN			64
+#define NR_PINNED_PAGES_ON_STACK	8

8 pinned pages on stack? Which stack?

The common cases need to touch few pages, and we can keep the
pointers in an array on the kernel stack within the cpu_opv system
call.

Updating to:

/*
 * Typical invocation of cpu_opv need few pages. Keep struct page
 * pointers in an array on the stack of the cpu_opv system call up to
 * this limit, beyond which the array is dynamically allocated.
 */
#define NR_PIN_PAGES_ON_STACK        8

quoted

+/*
+ * The cpu_opv system call executes a vector of operations on behalf of
+ * user-space on a specific CPU with preemption disabled. It is inspired
+ * from readv() and writev() system calls which take a "struct iovec"

s/from/by/

ok

quoted

+ * array as argument.
+ *
+ * The operations available are: comparison, memcpy, add, or, and, xor,
+ * left shift, and right shift. The system call receives a CPU number
+ * from user-space as argument, which is the CPU on which those
+ * operations need to be performed. All preparation steps such as
+ * loading pointers, and applying offsets to arrays, need to be
+ * performed by user-space before invoking the system call. The

loading pointers and applying offsets? That makes no sense.

Updating to:

 * All preparation steps such as
 * loading base pointers, and adding offsets derived from the current
 * CPU number, need to be performed by user-space before invoking the
 * system call.

quoted

+ * "comparison" operation can be used to check that the data used in the
+ * preparation step did not change between preparation of system call
+ * inputs and operation execution within the preempt-off critical
+ * section.
+ *
+ * The reason why we require all pointer offsets to be calculated by
+ * user-space beforehand is because we need to use get_user_pages_fast()
+ * to first pin all pages touched by each operation. This takes care of

That doesnt explain it either.

What kind of explication are you looking for here ? Perhaps being too close
to the implementation prevents me from understanding what is unclear from
your perspective.

quoted

+ * faulting-in the pages.  Then, preemption is disabled, and the
+ * operations are performed atomically with respect to other thread
+ * execution on that CPU, without generating any page fault.
+ *
+ * A maximum limit of 16 operations per cpu_opv syscall invocation is
+ * enforced, and a overall maximum length sum, so user-space cannot
+ * generate a too long preempt-off critical section. Each operation is
+ * also limited a length of PAGE_SIZE bytes, meaning that an operation
+ * can touch a maximum of 4 pages (memcpy: 2 pages for source, 2 pages
+ * for destination if addresses are not aligned on page boundaries).

Sorry, that paragraph was unclear. Updated:

 * An overall maximum of 4216 bytes in enforced on the sum of operation
 * length within an operation vector, so user-space cannot generate a
 * too long preempt-off critical section (cache cold critical section
 * duration measured as 4.7µs on x86-64). Each operation is also limited
 * a length of PAGE_SIZE bytes, meaning that an operation can touch a
 * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
 * destination if addresses are not aligned on page boundaries).

What's the critical section duration for operations which go to the limits
of this on a average x86 64 machine?

When cache-cold, I measure 4.7 µs per critical section doing a
4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an
acceptable preempt-off latency for RT ?

quoted

+ * If the thread is not running on the requested CPU, a new
+ * push_task_to_cpu() is invoked to migrate the task to the requested

new push_task_to_cpu()? Once that patch is merged push_task_to_cpu() is
hardly new.

Please refrain from putting function level details into comments which
describe the concept. The function name might change in 3 month from now
and the comment will stay stale, Its sufficient to say:

* If the thread is not running on the requested CPU it is migrated
* to it.

That explains the concept. It's completely irrelevant which mechanism is
used to achieve that.

quoted

+ * CPU.  If the requested CPU is not part of the cpus allowed mask of
+ * the thread, the system call fails with EINVAL. After the migration
+ * has been performed, preemption is disabled, and the current CPU
+ * number is checked again and compared to the requested CPU number. If
+ * it still differs, it means the scheduler migrated us away from that
+ * CPU. Return EAGAIN to user-space in that case, and let user-space
+ * retry (either requesting the same CPU number, or a different one,
+ * depending on the user-space algorithm constraints).

This mixture of imperative and impersonated mood is really hard to read.

This whole paragraph is replaced by:

 * If the thread is not running on the requested CPU, it is migrated to
 * it. If the scheduler then migrates the task away from the requested CPU
 * before the critical section executes, return EAGAIN to user-space,
 * and let user-space retry (either requesting the same CPU number, or a
 * different one, depending on the user-space algorithm constraints).

quoted

+/*
+ * Check operation types and length parameters.
+ */
+static int cpu_opv_check(struct cpu_op *cpuop, int cpuopcnt)
+{
+	int i;
+	uint32_t sum = 0;
+
+	for (i = 0; i < cpuopcnt; i++) {
+		struct cpu_op *op = &cpuop[i];
+
+		switch (op->op) {
+		case CPU_MB_OP:
+			break;
+		default:
+			sum += op->len;
+		}

Please separate the switch cases with an empty line.

ok

quoted

+static unsigned long cpu_op_range_nr_pages(unsigned long addr,
+		unsigned long len)

Please align the arguments

static unsigned long cpu_op_range_nr_pages(unsigned long addr,
				   unsigned long len)

is way simpler to parse. All over the place please.

ok

quoted

+{
+	return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1;

I'm surprised that there is no existing magic for this.

populate_vma_page_range() does:

unsigned long nr_pages = (end - start) / PAGE_SIZE;

where "start" and "end" need to fall onto a page boundary. It does not
seem to be appropriate for cases where addr is not page aligned, and where
the length is smaller than a page.

I have not seen helpers for this neither.

quoted

+}
+
+static int cpu_op_check_page(struct page *page)
+{
+	struct address_space *mapping;
+
+	if (is_zone_device_page(page))
+		return -EFAULT;
+	page = compound_head(page);
+	mapping = READ_ONCE(page->mapping);
+	if (!mapping) {
+		int shmem_swizzled;
+
+		/*
+		 * Check again with page lock held to guard against
+		 * memory pressure making shmem_writepage move the page
+		 * from filecache to swapcache.
+		 */
+		lock_page(page);
+		shmem_swizzled = PageSwapCache(page) || page->mapping;
+		unlock_page(page);
+		if (shmem_swizzled)
+			return -EAGAIN;
+		return -EFAULT;
+	}
+	return 0;
+}
+
+/*
+ * Refusing device pages, the zero page, pages in the gate area, and
+ * special mappings. Inspired from futex.c checks.

That comment should be on the function above, because this loop does not
much checking. Aside of that a more elaborate explanation how those checks
are done and how that works would be appreciated.

OK. I also noticed through testing that I missed faulting in pages, similarly
to sys_futex(). I'll add it, and I'm also adding a test in the selftests
for this case.

I'll import comments from futex.c.

quoted

+ */
+static int cpu_op_check_pages(struct page **pages,
+		unsigned long nr_pages)
+{
+	unsigned long i;
+
+	for (i = 0; i < nr_pages; i++) {
+		int ret;
+
+		ret = cpu_op_check_page(pages[i]);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
+		struct cpu_opv_pinned_pages *pin_pages, int write)
+{
+	struct page *pages[2];
+	int ret, nr_pages;
+
+	if (!len)
+		return 0;
+	nr_pages = cpu_op_range_nr_pages(addr, len);
+	BUG_ON(nr_pages > 2);

If that happens then you can emit a warning and return a proper error
code. BUG() is the last resort if there is no way to recover. This really
does not qualify.

ok. will do:

        nr_pages = cpu_op_range_nr_pages(addr, len);
        if (nr_pages > 2) {
                WARN_ON(1);
                return -EINVAL;
        }

quoted

+	if (!pin_pages->is_kmalloc && pin_pages->nr + nr_pages
+			> NR_PINNED_PAGES_ON_STACK) {

Now I see what this is used for. That's a complete misnomer.

And this check is of course completely self explaining..... NOT!

quoted

+		struct page **pinned_pages =
+			kzalloc(CPU_OP_VEC_LEN_MAX * CPU_OP_MAX_PAGES
+				* sizeof(struct page *), GFP_KERNEL);
+		if (!pinned_pages)
+			return -ENOMEM;
+		memcpy(pinned_pages, pin_pages->pages,
+			pin_pages->nr * sizeof(struct page *));
+		pin_pages->pages = pinned_pages;
+		pin_pages->is_kmalloc = true;

I have no idea why this needs to be done here and cannot be done in a
preparation step. That's horrible. You allocate conditionally at some
random place and then free at the end of the syscall.

What's wrong with:

      prepare_stuff()
      pin_pages()
      do_ops()
      cleanup_stuff()

Hmm?

Will do.

quoted

+	}
+again:
+	ret = get_user_pages_fast(addr, nr_pages, write, pages);
+	if (ret < nr_pages) {
+		if (ret > 0)
+			put_page(pages[0]);
+		return -EFAULT;
+	}
+	/*
+	 * Refuse device pages, the zero page, pages in the gate area,
+	 * and special mappings.

So the same comment again. Really helpful.

I'll remove this duplicated comment.

quoted

+	 */
+	ret = cpu_op_check_pages(pages, nr_pages);
+	if (ret == -EAGAIN) {
+		put_page(pages[0]);
+		if (nr_pages > 1)
+			put_page(pages[1]);
+		goto again;
+	}

So why can't you propagate EAGAIN to the caller and use the error cleanup
label?

Because it needs to retry immediately in case the page has been faulted in,
or is being swapped in.

Or put the sequence of get_user_pages_fast() and check_pages() into
one function and confine the mess there instead of having the same cleanup
sequence 3 times in this function.

I'll merge all this into a single error path.

quoted

+	if (ret)
+		goto error;
+	pin_pages->pages[pin_pages->nr++] = pages[0];
+	if (nr_pages > 1)
+		pin_pages->pages[pin_pages->nr++] = pages[1];
+	return 0;
+
+error:
+	put_page(pages[0]);
+	if (nr_pages > 1)
+		put_page(pages[1]);
+	return -EFAULT;
+}

Updated function:

static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
                            struct cpu_opv_pin_pages *pin_pages,
                            int write)
{
        struct page *pages[2];
        int ret, nr_pages, nr_put_pages, n;

        nr_pages = cpu_op_count_pages(addr, len);
        if (!nr_pages)
                return 0;
again:
        ret = get_user_pages_fast(addr, nr_pages, write, pages);
        if (ret < nr_pages) {
                if (ret >= 0) {
                        nr_put_pages = ret;
                        ret = -EFAULT;
                } else {
                        nr_put_pages = 0;
                }
                goto error;
        }
        ret = cpu_op_check_pages(pages, nr_pages, addr);
        if (ret) {
                nr_put_pages = nr_pages;
                goto error;
        }
        for (n = 0; n < nr_pages; n++)
                pin_pages->pages[pin_pages->nr++] = pages[n];
        return 0;

error:
        for (n = 0; n < nr_put_pages; n++)
                put_page(pages[n]);
        /*
         * Retry if a page has been faulted in, or is being swapped in.
         */
        if (ret == -EAGAIN)
                goto again;
        return ret;
}

quoted

+
+static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt,
+		struct cpu_opv_pinned_pages *pin_pages)
+{
+	int ret, i;
+	bool expect_fault = false;
+
+	/* Check access, pin pages. */
+	for (i = 0; i < cpuopcnt; i++) {
+		struct cpu_op *op = &cpuop[i];
+
+		switch (op->op) {
+		case CPU_COMPARE_EQ_OP:
+		case CPU_COMPARE_NE_OP:
+			ret = -EFAULT;
+			expect_fault = op->u.compare_op.expect_fault_a;
+			if (!access_ok(VERIFY_READ,
+					(void __user *)op->u.compare_op.a,
+					op->len))
+				goto error;
+			ret = cpu_op_pin_pages(
+					(unsigned long)op->u.compare_op.a,
+					op->len, pin_pages, 0);

Bah, this sucks. Moving the switch() into a separate function spares you
one indentation level and all these horrible to read line breaks.

done

And again I really have to ask why all of this stuff needs to be type
casted for every invocation. If you use the proper type for the argument
and then do the cast at the function entry then you can spare all that hard
to read crap.

fixed for cpu_op_pin_pages. I don't control the type expected by access_ok()
though.

quoted

+error:
+	for (i = 0; i < pin_pages->nr; i++)
+		put_page(pin_pages->pages[i]);
+	pin_pages->nr = 0;

Sigh. Why can't you do that at the call site where you have exactly the
same thing?

Good point. fixed.

quoted

+	/*
+	 * If faulting access is expected, return EAGAIN to user-space.
+	 * It allows user-space to distinguish between a fault caused by
+	 * an access which is expect to fault (e.g. due to concurrent
+	 * unmapping of underlying memory) from an unexpected fault from
+	 * which a retry would not recover.
+	 */
+	if (ret == -EFAULT && expect_fault)
+		return -EAGAIN;
+	return ret;
+}
+
+/* Return 0 if same, > 0 if different, < 0 on error. */
+static int do_cpu_op_compare_iter(void __user *a, void __user *b, uint32_t len)
+{
+	char bufa[TMP_BUFLEN], bufb[TMP_BUFLEN];
+	uint32_t compared = 0;
+
+	while (compared != len) {
+		unsigned long to_compare;
+
+		to_compare = min_t(uint32_t, TMP_BUFLEN, len - compared);
+		if (__copy_from_user_inatomic(bufa, a + compared, to_compare))
+			return -EFAULT;
+		if (__copy_from_user_inatomic(bufb, b + compared, to_compare))
+			return -EFAULT;
+		if (memcmp(bufa, bufb, to_compare))
+			return 1;	/* different */

These tail comments are really crap. It's entirely obvious that if memcmp
!= 0 the result is different. So what is the exact value aside of making it
hard to read?

Removed.

quoted

+		compared += to_compare;
+	}
+	return 0;	/* same */

Ditto.

quoted

+}
+
+/* Return 0 if same, > 0 if different, < 0 on error. */
+static int do_cpu_op_compare(void __user *a, void __user *b, uint32_t len)
+{
+	int ret = -EFAULT;
+	union {
+		uint8_t _u8;
+		uint16_t _u16;
+		uint32_t _u32;
+		uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+		uint32_t _u64_split[2];
+#endif
+	} tmp[2];

I've seen the same union before

quoted

+union op_fn_data {

......

Ah, yes. It's already declared! I should indeed use it :)

quoted

+
+	pagefault_disable();
+	switch (len) {
+	case 1:
+		if (__get_user(tmp[0]._u8, (uint8_t __user *)a))
+			goto end;
+		if (__get_user(tmp[1]._u8, (uint8_t __user *)b))
+			goto end;
+		ret = !!(tmp[0]._u8 != tmp[1]._u8);
+		break;
+	case 2:
+		if (__get_user(tmp[0]._u16, (uint16_t __user *)a))
+			goto end;
+		if (__get_user(tmp[1]._u16, (uint16_t __user *)b))
+			goto end;
+		ret = !!(tmp[0]._u16 != tmp[1]._u16);
+		break;
+	case 4:
+		if (__get_user(tmp[0]._u32, (uint32_t __user *)a))
+			goto end;
+		if (__get_user(tmp[1]._u32, (uint32_t __user *)b))
+			goto end;
+		ret = !!(tmp[0]._u32 != tmp[1]._u32);
+		break;
+	case 8:
+#if (BITS_PER_LONG >= 64)

We alredy prepare for 128 bit?

== it is then ;)

quoted

+		if (__get_user(tmp[0]._u64, (uint64_t __user *)a))
+			goto end;
+		if (__get_user(tmp[1]._u64, (uint64_t __user *)b))
+			goto end;
+#else
+		if (__get_user(tmp[0]._u64_split[0], (uint32_t __user *)a))
+			goto end;
+		if (__get_user(tmp[0]._u64_split[1], (uint32_t __user *)a + 1))
+			goto end;
+		if (__get_user(tmp[1]._u64_split[0], (uint32_t __user *)b))
+			goto end;
+		if (__get_user(tmp[1]._u64_split[1], (uint32_t __user *)b + 1))
+			goto end;
+#endif
+		ret = !!(tmp[0]._u64 != tmp[1]._u64);

This really sucks.

       union foo va, vb;

pagefault_disable();
switch (len) {
case 1:
case 2:
case 4:
case 8:
	va._u64 = _vb._u64 = 0;
	if (op_get_user(&va, a, len))
		goto out;
	if (op_get_user(&vb, b, len))
		goto out;
	ret = !!(va._u64 != vb._u64);
	break;
default:
	...

and have

int op_get_user(union foo *val, void *p, int len)
{
switch (len) {
case 1:
     ......

And do the magic once in that function then you spare that copy and pasted
code above. It can be reused in the other ops as well and reduces the amount
of copy and paste code significantly.

Good point! done.

quoted

+		break;
+	default:
+		pagefault_enable();
+		return do_cpu_op_compare_iter(a, b, len);
+	}
+end:
+	pagefault_enable();
+	return ret;
+}

quoted

+static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt)
+{
+	int i, ret;
+
+	for (i = 0; i < cpuopcnt; i++) {
+		struct cpu_op *op = &cpuop[i];
+
+		/* Guarantee a compiler barrier between each operation. */
+		barrier();
+
+		switch (op->op) {
+		case CPU_COMPARE_EQ_OP:
+			ret = do_cpu_op_compare(
+					(void __user *)op->u.compare_op.a,
+					(void __user *)op->u.compare_op.b,
+					op->len);

I think you know by now how to spare an indentation level and type casts.

done

quoted

+static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt, int cpu)
+{
+	int ret;
+
+	if (cpu != raw_smp_processor_id()) {
+		ret = push_task_to_cpu(current, cpu);
+		if (ret)
+			goto check_online;
+	}
+	preempt_disable();
+	if (cpu != smp_processor_id()) {
+		ret = -EAGAIN;

This is weird. When the above raw_smp_processor_id() check fails you push,
but here you return. Not really consistent behaviour.

Good point. We could re-try internally rather than let user-space
deal with an EAGAIN. It will make the error checking easier in
user-space.

quoted

+		goto end;
+	}
+	ret = __do_cpu_opv(cpuop, cpuopcnt);
+end:
+	preempt_enable();
+	return ret;
+
+check_online:
+	if (!cpu_possible(cpu))
+		return -EINVAL;
+	get_online_cpus();
+	if (cpu_online(cpu)) {
+		ret = -EAGAIN;
+		goto put_online_cpus;
+	}
+	/*
+	 * CPU is offline. Perform operation from the current CPU with
+	 * cpu_online read lock held, preventing that CPU from coming online,
+	 * and with mutex held, providing mutual exclusion against other
+	 * CPUs also finding out about an offline CPU.
+	 */

That's not mentioned in the comment at the top IIRC.

Updated.

quoted

+	mutex_lock(&cpu_opv_offline_lock);
+	ret = __do_cpu_opv(cpuop, cpuopcnt);
+	mutex_unlock(&cpu_opv_offline_lock);
+put_online_cpus:
+	put_online_cpus();
+	return ret;
+}
+
+/*
+ * cpu_opv - execute operation vector on a given CPU with preempt off.
+ *
+ * Userspace should pass current CPU number as parameter. May fail with
+ * -EAGAIN if currently executing on the wrong CPU.
+ */
+SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
+		int, cpu, int, flags)
+{
+	struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX];
+	struct page *pinned_pages_on_stack[NR_PINNED_PAGES_ON_STACK];

Oh well.... Naming sucks.

quoted

+	struct cpu_opv_pinned_pages pin_pages = {
+		.pages = pinned_pages_on_stack,
+		.nr = 0,
+		.is_kmalloc = false,
+	};
+	int ret, i;
+
+	if (unlikely(flags))
+		return -EINVAL;
+	if (unlikely(cpu < 0))
+		return -EINVAL;
+	if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX)
+		return -EINVAL;
+	if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op)))
+		return -EFAULT;
+	ret = cpu_opv_check(cpuopv, cpuopcnt);

AFAICT you can calculate the number of pages already in the check and then
do that allocation before pinning the pages.

will do.

quoted

+	if (ret)
+		return ret;
+	ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &pin_pages);
+	if (ret)
+		goto end;
+	ret = do_cpu_opv(cpuopv, cpuopcnt, cpu);
+	for (i = 0; i < pin_pages.nr; i++)
+		put_page(pin_pages.pages[i]);
+end:
+	if (pin_pages.is_kmalloc)
+		kfree(pin_pages.pages);
+	return ret;
+}

quoted

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6bba05f47e51..e547f93a46c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c

@@ -1052,6 +1052,43 @@ void do_set_cpus_allowed(struct task_struct *p, const

struct cpumask *new_mask)
 		set_curr_task(rq, p);
 }

This is NOT part of this functionality. It's a prerequisite and wants to be
in a separate patch. And I'm dead tired by now so I leave that thing for
tomorrow or for Peter.

I'll split that part into a separate patch.

Thanks!

Mathieu

Thanks,

	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help