Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP
From: Ryan Roberts <ryan.roberts@arm.com>
Date: 2024-01-31 15:03:02
Also in:
linux-arm-kernel, linux-riscv, linux-s390, linuxppc-dev, lkml, sparclinux
On 31/01/2024 14:29, David Hildenbrand wrote:
quoted
quoted
Note that regarding NUMA effects, I mean when some memory access within the same socket is faster/slower even with only a single node. On AMD EPYC that's possible, depending on which core you are running and on which memory controller the memory you want to access is located. If both are in different quadrants IIUC, the access latency will be different.I've configured the NUMA to only bring the RAM and CPUs for a single socket online, so I shouldn't be seeing any of these effects. Anyway, I've been using the Altra as a secondary because its so much slower than the M2. Let me move over to it and see if everything looks more straightforward there.Better use a system where people will actually run Linux production workloads on, even if it is slower :) [...]quoted
quoted
quoted
I'll continue to mess around with it until the end of the day. But I'm not making any headway, then I'll change tack; I'll just measure the performance of my contpte changes using your fork/zap stuff as the baseline and post based on that.You should likely not focus on M2 results. Just pick a representative bare metal machine where you get consistent, explainable results. Nothing in the code is fine-tuned for a particular architecture so far, only order-0 handling is kept separate. BTW: I see the exact same speedups for dontneed that I see for munmap. For example, for order-9, it goes from 0.023412s -> 0.009785, so -58%. So I'm curious why you see a speedup for munmap but not for dontneed.Ugh... ok, coming up.Hopefully you were just staring at the wrong numbers (e.g., only with fork patches). Because both (munmap/pte-dontneed) are using the exact same code path.
Ahh... I'm doing pte-dontneed, which is the only option in your original benchmark - it does MADV_DONTNEED one page at a time. It looks like your new benchmark has an additional "dontneed" option that does it in one shot. Which option are you running? Assuming the latter, I think that explains it.