On 2022/10/27 22:19, Punit Agrawal wrote:
[ Apologies for chiming in late in the conversation ]
Anshuman Khandual [off-list ref] writes:
quoted
On 9/28/22 05:53, Barry Song wrote:
quoted
On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang [off-list ref] wrote:
quoted
On 2022/9/27 14:16, Anshuman Khandual wrote:
quoted
[...]
On 9/21/22 14:13, Yicong Yang wrote:
quoted
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+ /* for small systems with small number of CPUs, TLB shootdown is cheap */
+ if (num_online_cpus() <= 4)
It would be great to have some more inputs from others, whether 4 (which should
to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
is optimal for an wide range of arm64 platforms.
I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
with 5,6,7
cores.
I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
this patch.
so it seems safe to have
if (num_online_cpus() < 8)
quoted
Do you prefer this macro to be static or make it configurable through kconfig then
different platforms can make choice based on their own situations? It maybe hard to
test on all the arm64 platforms.
Maybe we can have this default enabled on machines with 8 and more cpus and
provide a tlbflush_batched = on or off to allow users enable or
disable it according
to their hardware and products. Similar example: rodata=on or off.
No, sounds bit excessive. Kernel command line options should not be added
for every possible run time switch options.
quoted
Hi Anshuman, Will, Catalin, Andrew,
what do you think about this approach?
BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf0fc@linux.alibaba.com/ (local)
I do believe we need it based on the expensive cost of tlb shootdown in arm64
even by hardware broadcast.
Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
with CONFIG_EXPERT and for num_online_cpus() > 8 ?
When running the test program in the commit in a VM, I saw benefits from
the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
ptep_clear_flush() went from ~1% in the unpatched version to not showing
up.
Maybe you're booting VM on a server with more than 32 cores and Barry tested
on his 4 CPUs embedded platform. I guess a 4 CPU VM is not fully equivalent to
a 4 CPU real machine as the tbli and dsb in the VM may influence the host
as well.
Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is
there any overhead? I am wondering what are the downsides of enabling
the config by default.
Thanks,
Punit
.