Re: Optimizing kernel compilation / alignments for network performance
From: Rafał Miłecki <zajec5@gmail.com>
Date: 2022-05-05 15:44:19
Also in:
linux-arm-kernel
On 29.04.2022 16:49, Arnd Bergmann wrote:
On Wed, Apr 27, 2022 at 7:31 PM Rafał Miłecki [off-list ref] wrote:quoted
On 27.04.2022 14:56, Alexander Lobakin wrote:quoted
Thank you Alexander, this appears to be helpful! I decided to ignore CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for now and just adjust CFLAGS manually. 1. Without ce5013ff3bec and with -falign-functions=32 387 Mb/s 2. Without ce5013ff3bec and with -falign-functions=64 377 Mb/s 3. With ce5013ff3bec and with -falign-functions=32 384 Mb/s 4. With ce5013ff3bec and with -falign-functions=64 377 Mb/s So it seems that: 1. -falign-functions=32 = pretty stable high speed 2. -falign-functions=64 = very stable slightly lower speed I'm going to perform tests on more commits but if it stays so reliable as above that will be a huge success for me.Note that the problem may not just be the alignment of a particular function, but also how different function map into your cache. The Cortex-A9 has a 4-way set-associative L1 cache of 16KB, 32KB or 64KB, with a line size of 32 bytes. If you are unlucky and you get five different functions that are frequently called and are a multiple functions are exactly the wrong spacing that they need more than four ways, calling them in sequence would always evict the other ones. The same could of course happen if the problem is the D-cache or the L2. Can you try to get a profile using 'perf record' to see where most time is spent, in both the slowest and the fastest versions? If the instruction cache is the issue, you should see how the hottest addresses line up.
Your explanation sounds sane of course. If you take a look at my old e-mail ARM router NAT performance affected by random/unrelated commits https://lkml.org/lkml/2019/5/21/349 https://www.spinics.net/lists/linux-block/msg40624.html you'll see that most used functions are: v7_dma_inv_range __irqentry_text_end l2c210_inv_range v7_dma_clean_range bcma_host_soc_read32 __netif_receive_skb_core arch_cpu_idle l2c210_clean_range fib_table_lookup Is there a way to optimize kernel for optimal cache usage of selected (above) functions? Meanwhile I was testing -fno-reorder-blocks which some OpenWrt folks reported as worth trying. It's another randomness. It stabilizes NAT performance across some commits and breaks stability across others.