Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

[PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · David Hildenbrand <hidden> · 2024-01-29
[PATCH v3 01/15] arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 01/15] arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary · Mike Rapoport <rppt@kernel.org> · 2024-02-08
Re: [PATCH v3 01/15] arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary · David Hildenbrand <hidden> · 2024-02-09
[PATCH v3 02/15] arm/pgtable: define PFN_PTE_SHIFT · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 02/15] arm/pgtable: define PFN_PTE_SHIFT · Mike Rapoport <rppt@kernel.org> · 2024-02-08
[PATCH v3 03/15] nios2/pgtable: define PFN_PTE_SHIFT · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 03/15] nios2/pgtable: define PFN_PTE_SHIFT · Mike Rapoport <rppt@kernel.org> · 2024-02-08
[PATCH v3 04/15] powerpc/pgtable: define PFN_PTE_SHIFT · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 04/15] powerpc/pgtable: define PFN_PTE_SHIFT · Mike Rapoport <rppt@kernel.org> · 2024-02-08
[PATCH v3 05/15] riscv/pgtable: define PFN_PTE_SHIFT · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 05/15] riscv/pgtable: define PFN_PTE_SHIFT · Mike Rapoport <rppt@kernel.org> · 2024-02-08
[PATCH v3 06/15] s390/pgtable: define PFN_PTE_SHIFT · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 06/15] s390/pgtable: define PFN_PTE_SHIFT · Mike Rapoport <rppt@kernel.org> · 2024-02-08
[PATCH v3 07/15] sparc/pgtable: define PFN_PTE_SHIFT · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 07/15] sparc/pgtable: define PFN_PTE_SHIFT · Mike Rapoport <rppt@kernel.org> · 2024-02-08
[PATCH v3 08/15] mm/pgtable: make pte_next_pfn() independent of set_ptes() · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 08/15] mm/pgtable: make pte_next_pfn() independent of set_ptes() · Mike Rapoport <rppt@kernel.org> · 2024-02-08
[PATCH v3 09/15] arm/mm: use pte_next_pfn() in set_ptes() · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 09/15] arm/mm: use pte_next_pfn() in set_ptes() · Mike Rapoport <rppt@kernel.org> · 2024-02-08
[PATCH v3 10/15] powerpc/mm: use pte_next_pfn() in set_ptes() · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 10/15] powerpc/mm: use pte_next_pfn() in set_ptes() · Mike Rapoport <rppt@kernel.org> · 2024-02-08
[PATCH v3 11/15] mm/memory: factor out copying the actual PTE in copy_present_pte() · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 11/15] mm/memory: factor out copying the actual PTE in copy_present_pte() · Mike Rapoport <rppt@kernel.org> · 2024-02-08
[PATCH v3 12/15] mm/memory: pass PTE to copy_present_pte() · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 12/15] mm/memory: pass PTE to copy_present_pte() · Mike Rapoport <rppt@kernel.org> · 2024-02-08
Re: [PATCH v3 12/15] mm/memory: pass PTE to copy_present_pte() · David Hildenbrand <hidden> · 2024-02-14
[PATCH v3 13/15] mm/memory: optimize fork() with PTE-mapped THP · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 13/15] mm/memory: optimize fork() with PTE-mapped THP · Mike Rapoport <rppt@kernel.org> · 2024-02-08
[PATCH v3 14/15] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch() · David Hildenbrand <hidden> · 2024-01-29
[PATCH v3 15/15] mm/memory: ignore writable bit in folio_pte_batch() · David Hildenbrand <hidden> · 2024-01-29
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · Ryan Roberts <ryan.roberts@arm.com> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · David Hildenbrand <hidden> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · Ryan Roberts <ryan.roberts@arm.com> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · David Hildenbrand <hidden> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · Ryan Roberts <ryan.roberts@arm.com> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · Ryan Roberts <ryan.roberts@arm.com> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · David Hildenbrand <hidden> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · Ryan Roberts <ryan.roberts@arm.com> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · David Hildenbrand <hidden> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · Ryan Roberts <ryan.roberts@arm.com> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · David Hildenbrand <hidden> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · Ryan Roberts <ryan.roberts@arm.com> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · David Hildenbrand <hidden> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · Ryan Roberts <ryan.roberts@arm.com> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · David Hildenbrand <hidden> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · David Hildenbrand <hidden> · 2024-01-31
Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP · patchwork-bot+linux-riscv@kernel.org · 2024-03-25

From: Ryan Roberts <ryan.roberts@arm.com>
Date: 2024-01-31 15:03:02
Also in: linux-arm-kernel, linux-riscv, linux-s390, linuxppc-dev, lkml, sparclinux

On 31/01/2024 14:29, David Hildenbrand wrote:

quoted

Note that regarding NUMA effects, I mean when some memory access within the same
socket is faster/slower even with only a single node. On AMD EPYC that's
possible, depending on which core you are running and on which memory controller
the memory you want to access is located. If both are in different quadrants
IIUC, the access latency will be different.

I've configured the NUMA to only bring the RAM and CPUs for a single socket
online, so I shouldn't be seeing any of these effects. Anyway, I've been using
the Altra as a secondary because its so much slower than the M2. Let me move
over to it and see if everything looks more straightforward there.

Better use a system where people will actually run Linux production workloads
on, even if it is slower :)

[...]

quoted

I'll continue to mess around with it until the end of the day. But I'm not
making any headway, then I'll change tack; I'll just measure the performance of
my contpte changes using your fork/zap stuff as the baseline and post based on
that.

You should likely not focus on M2 results. Just pick a representative bare metal
machine where you get consistent, explainable results.

Nothing in the code is fine-tuned for a particular architecture so far, only
order-0 handling is kept separate.

BTW: I see the exact same speedups for dontneed that I see for munmap. For
example, for order-9, it goes from 0.023412s -> 0.009785, so -58%. So I'm
curious why you see a speedup for munmap but not for dontneed.

Ugh... ok, coming up.

Hopefully you were just staring at the wrong numbers (e.g., only with fork
patches). Because both (munmap/pte-dontneed) are using the exact same code path.

Ahh... I'm doing pte-dontneed, which is the only option in your original
benchmark - it does MADV_DONTNEED one page at a time. It looks like your new
benchmark has an additional "dontneed" option that does it in one shot. Which
option are you running? Assuming the latter, I think that explains it.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help