[PATH V2 1/6] mm: Introduce a general RCU get_user_pages_fast.

From: Steve Capper <hidden>
Date: 2014-08-27 12:50:38
Also in: linux-arch, linux-mm

On Wed, Aug 27, 2014 at 09:54:42AM +0100, Will Deacon wrote:

Hi Steve,

Hey Will,

A few minor comments (took me a while to understand how this works, so I
thought I'd make some noise :)

A big thank you for reading through it :-).

On Thu, Aug 21, 2014 at 04:43:27PM +0100, Steve Capper wrote:

quoted

get_user_pages_fast attempts to pin user pages by walking the page
tables directly and avoids taking locks. Thus the walker needs to be
protected from page table pages being freed from under it, and needs
to block any THP splits.

One way to achieve this is to have the walker disable interrupts, and
rely on IPIs from the TLB flushing code blocking before the page table
pages are freed.

On some platforms we have hardware broadcast of TLB invalidations, thus
the TLB flushing code doesn't necessarily need to broadcast IPIs; and
spuriously broadcasting IPIs can hurt system performance if done too
often.

This problem has been solved on PowerPC and Sparc by batching up page
table pages belonging to more than one mm_user, then scheduling an
rcu_sched callback to free the pages. This RCU page table free logic
has been promoted to core code and is activated when one enables
HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
their own get_user_pages_fast routines.

The RCU page table free logic coupled with a an IPI broadcast on THP
split (which is a rare event), allows one to protect a page table
walker by merely disabling the interrupts during the walk.

Disabling interrupts isn't completely free (it's a self-synchronising
operation on ARM). It would be interesting to see if your futex workload
performance is improved by my simple irq_save optimisation for ARM:

  https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=misc-patches&id=312a70adfa6f22e9d62803dd21400f481253e58b

(I've been struggling to show anything other than tiny improvements from
that patch).

This looks like a useful optimisation; I'll have a think about workloads that
fire many futexes on THP tails. (The test I used only fired off one futex).

quoted

This patch provides a general RCU implementation of get_user_pages_fast
that can be used by architectures that perform hardware broadcast of
TLB invalidations.

It is based heavily on the PowerPC implementation by Nick Piggin.

[...]

quoted

diff --git a/mm/gup.c b/mm/gup.c
index 91d044b..2f684fa 100644
--- a/mm/gup.c
+++ b/mm/gup.c

@@ -10,6 +10,10 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+#include <asm/pgtable.h>
+
 #include "internal.h"
 
 static struct page *no_page_table(struct vm_area_struct *vma,

@@ -672,3 +676,277 @@ struct page *get_dump_page(unsigned long addr)
 	return page;
 }
 #endif /* CONFIG_ELF_CORE */
+
+#ifdef CONFIG_HAVE_RCU_GUP
+
+#ifdef __HAVE_ARCH_PTE_SPECIAL

Do we actually require this (pte special) if hugepages are disabled or
not supported?

We need this logic if we want use fast_gup on normal pages safely. The special
bit indicates that we should not attempt to take a reference to the underlying
page.

Huge pages are guaranteed not to be special.

Cheers,
-- 
Steve

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help