Thread (56 messages) 56 messages, 7 authors, 2011-10-17

Re: [PATCH] thp: tail page refcounting fix #3

From: Michel Lespinasse <hidden>
Date: 2011-08-27 09:42:06
Also in: lkml

On Fri, Aug 26, 2011 at 08:54:36PM +0200, Andrea Arcangeli wrote:
Subject: thp: tail page refcounting fix

From: Andrea Arcangeli <redacted>

Michel while working on the working set estimation code, noticed that calling
get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the
pfn ended up being a tail page of a transparent hugepage under splitting by
__split_huge_page_refcount(). He then found the problem could also
theoretically materialize with page_cache_get_speculative() during the
speculative radix tree lookups that uses get_page_unless_zero() in SMP if the
radix tree page is freed and reallocated and get_user_pages is called on it
before page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at all
times. This will guarantee that get_page_unless_zero() can never succeed on any
tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail
pages of a compound page, so we can simply account the tail page references
there and transfer them to tail_page->_count in __split_huge_page_refcount() (in
addition to the head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't entirely
safe because the two atomic_inc in get_page weren't atomic. As opposed other
get_user_page users like secondary-MMU page fault to establish the shadow
pagetables would never call any superflous get_page after get_user_page
returns. It's safer to make get_page universally safe for tail pages and to use
get_page_foll() within follow_page (inside get_user_pages()). get_page_foll()
is safe to do the refcounting for tail pages without taking any locks because
it is run within PT lock protected critical sections (PT lock for pte and
page_table_lock for pmd_trans_huge). The standard get_page() as invoked by
direct-io instead will now take the compound_lock but still only for tail
pages. The direct-io paths are usually I/O bound and the compound_lock is per
THP so very finegrined, so there's no risk of scalability issues with it. A
simple direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no overhead.
So it's worth it. Ideally direct-io should stop calling get_page() on pages
returned by get_user_pages(). The spinlock in get_page() is already optimized
away for no-THP builds but doing get_page() on tail pages returned by GUP is
generally a rare operation and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new RCU
critical sections will also allow the working set estimation code to work
without any further complexity associated to the tail page refcounting
with THP.

Signed-off-by: Andrea Arcangeli <redacted>
Reported-by: Michel Lespinasse <redacted>
Looks great !

I understand you may have to remove the VM_BUG_ON(page_mapcount(page) <= 0)
that I had suggested in __get_page_tail() (sorry about that).

My only additional suggestion is about the put_page_testzero in
__get_page_tail(), maybe if you could just increment the tail page count
instead of calling __get_page_tail_foll(), then you wouldn't have to
release the extra head page count there. And it would even look kinda
natural, head page count gets acquired before compound_lock_irqsave(),
so we only have to acquire an extra tail page count after confirming
this is still a tail page.

Either way, the code looks OK by now.

Reviewed-by: Michel Lespinasse <redacted>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help