[PATH V2 1/6] mm: Introduce a general RCU get_user_pages_fast.
From: Steve Capper <hidden>
Date: 2014-08-27 12:50:38
Also in:
linux-arch, linux-mm
On Wed, Aug 27, 2014 at 09:54:42AM +0100, Will Deacon wrote:
Hi Steve,
Hey Will,
A few minor comments (took me a while to understand how this works, so I thought I'd make some noise :)
A big thank you for reading through it :-).
On Thu, Aug 21, 2014 at 04:43:27PM +0100, Steve Capper wrote:quoted
get_user_pages_fast attempts to pin user pages by walking the page tables directly and avoids taking locks. Thus the walker needs to be protected from page table pages being freed from under it, and needs to block any THP splits. One way to achieve this is to have the walker disable interrupts, and rely on IPIs from the TLB flushing code blocking before the page table pages are freed. On some platforms we have hardware broadcast of TLB invalidations, thus the TLB flushing code doesn't necessarily need to broadcast IPIs; and spuriously broadcasting IPIs can hurt system performance if done too often. This problem has been solved on PowerPC and Sparc by batching up page table pages belonging to more than one mm_user, then scheduling an rcu_sched callback to free the pages. This RCU page table free logic has been promoted to core code and is activated when one enables HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement their own get_user_pages_fast routines. The RCU page table free logic coupled with a an IPI broadcast on THP split (which is a rare event), allows one to protect a page table walker by merely disabling the interrupts during the walk.Disabling interrupts isn't completely free (it's a self-synchronising operation on ARM). It would be interesting to see if your futex workload performance is improved by my simple irq_save optimisation for ARM: https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=misc-patches&id=312a70adfa6f22e9d62803dd21400f481253e58b (I've been struggling to show anything other than tiny improvements from that patch).
This looks like a useful optimisation; I'll have a think about workloads that fire many futexes on THP tails. (The test I used only fired off one futex).
quoted
This patch provides a general RCU implementation of get_user_pages_fast that can be used by architectures that perform hardware broadcast of TLB invalidations. It is based heavily on the PowerPC implementation by Nick Piggin.[...]quoted
diff --git a/mm/gup.c b/mm/gup.c index 91d044b..2f684fa 100644 --- a/mm/gup.c +++ b/mm/gup.c@@ -10,6 +10,10 @@ #include <linux/swap.h> #include <linux/swapops.h> +#include <linux/sched.h> +#include <linux/rwsem.h> +#include <asm/pgtable.h> + #include "internal.h" static struct page *no_page_table(struct vm_area_struct *vma,@@ -672,3 +676,277 @@ struct page *get_dump_page(unsigned long addr) return page; } #endif /* CONFIG_ELF_CORE */ + +#ifdef CONFIG_HAVE_RCU_GUP + +#ifdef __HAVE_ARCH_PTE_SPECIALDo we actually require this (pte special) if hugepages are disabled or not supported?
We need this logic if we want use fast_gup on normal pages safely. The special bit indicates that we should not attempt to take a reference to the underlying page. Huge pages are guaranteed not to be special. Cheers, -- Steve