Thread (27 messages) 27 messages, 7 authors, 2023-12-04

Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

From: Ryan Roberts <ryan.roberts@arm.com>
Date: 2023-12-04 11:11:37
Also in: linux-mm, lkml

On 03/12/2023 13:33, Christophe Leroy wrote:

Le 30/11/2023 à 22:30, Peter Xu a écrit :
quoted
On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
quoted
On Fri, Nov 24, 2023 at 09:06:01AM +0000, Ryan Roberts wrote:
quoted
I don't have any micro-benchmarks for GUP though, if that's your question. Is
there an easy-to-use test I can run to get some numbers? I'd be happy to try it out.
Thanks Ryan.  Then nothing is needed to be tested if gup is not yet touched
from your side, afaict.  I'll see whether I can provide some rough numbers
instead in the next post (I'll probably only be able to test it in a VM,
though, but hopefully that should still reflect mostly the truth).
An update: I finished a round of 64K cont_pte test, in the slow gup micro
benchmark I see ~15% perf degrade with this patchset applied on a VM on top
of Apple M1.

Frankly that's even less than I expected, considering not only how slow gup
THP used to be, but also on the fact that that's a tight loop over slow
gup, which in normal cases shouldn't happen: "present" ptes normally goes
to fast-gup, while !present goes into a fault following it.  I assume
that's why nobody cared slow gup for THP before.  I think adding cont_pte
support shouldn't be very hard, but that will include making cont_pte idea
global just for arm64 and riscv Svnapot.
Is there any documentation on what cont_pte is ? I have always wondered 
if it could also fit powerpc 8xx need ?
pte_cont() (and pte_mkcont() and pte_mknoncont()) test and manipulte the
"contiguous bit" in the arm64 PTE entries. Those helpers are arm64-specific
(AFAIK). The contiguous bit is a hint to the HW to tell it that a block of PTEs
are mapping a physically contiguous and naturally aligned piece of memory. The
HW can use this to coalesce entries in the TLB. When using 4K base pages, the
contpte size is 64K (16 PTEs). For 16K base pages, its 2M (128 PTEs) and for 64K
base pages, its 2M (32 PTEs).
On powerpc, for 16k pages, we have to define 4 consecutive PTEs. All 4 
PTE are flagged with the SPS bit telling it's a 16k pages, but for TLB 
misses the HW needs one entrie for each 4k fragment.
From that description, it sounds like the SPS bit might be similar to arm64
contiguous bit? Although sounds like you are currently using it in a slightly
different way - telling kernel that the base page is 16K but mapping each 16K
page with 4x 4K entries (plus the SPS bit set)?
There is also a similar approach for 512k pages, we have 128 contiguous 
identical PTEs for them.

And whatever PAGE_SIZE is (either 4k or 16k), the HW needs one 'unsigned 
long' pte for each 4k fragment. So at the time being when we define 
PAGE_SIZE as 16k, we need a special pte_t which is a table of 4x 
unsigned long.

Wondering if the cont_pte concept is similar and whether it could help.
To be honest, while I understand pte_cont() and friends, I don't understand
their relevance (or at least potential future relevance) to GUP?
Thanks
Christophe
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help