Re: ARM router NAT performance affected by random/unrelated commits
From: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Date: 2019-05-22 12:17:39
Also in:
linux-block, lkml, netdev
On Wed, May 22, 2019 at 01:51:01PM +0200, Rafał Miłecki wrote:
On 21.05.2019 12:45, Russell King - ARM Linux admin wrote:> On Tue, May 21, 2019 at 12:28:48PM +0200, Rafał Miłecki wrote:quoted
quoted
I work on home routers based on Broadcom's Northstar SoCs. Those devices have ARM Cortex-A9 and most of them are dual-core. As for home routers, my main concern is network performance. That CPU isn't powerful enough to handle gigabit traffic so all kind of optimizations do matter. I noticed some unexpected changes in NAT performance when switching between kernels. My hardware is BCM47094 SoC (dual core ARM) with integrated network controller and external BCM53012 switch.Guessing, I'd say it's to do with the placement of code wrt cachelines. You could try aligning some of the cache flushing code to a cache line and see what effect that has.Is System.map a good place to check for functions code alignment? With Linux 4.19 + OpenWrt mtd patches I have: (...) c010ea94 t v7_dma_inv_range c010eae0 t v7_dma_clean_range (...) c02ca3d0 T blk_mq_update_nr_hw_queues c02ca69c T blk_mq_alloc_tag_set c02ca94c T blk_mq_release c02ca9b4 T blk_mq_free_queue c02caa88 T blk_mq_update_nr_requests c02cab50 T blk_mq_unique_tag (...) After cherry-picking 9316a9ed6895 ("blk-mq: provide helper for setting up an SQ queue and tag set"): (...) c010ea94 t v7_dma_inv_range c010eae0 t v7_dma_clean_range (...) c02ca3d0 T blk_mq_update_nr_hw_queues c02ca69c T blk_mq_alloc_tag_set c02ca94c T blk_mq_init_sq_queue <-- NEW c02ca9c0 T blk_mq_release <-- Different address of this & all below c02caa28 T blk_mq_free_queue c02caafc T blk_mq_update_nr_requests c02cabc4 T blk_mq_unique_tag (...) As you can see blk_mq_init_sq_queue has appeared in the System.map and it affected addresses of ~30000 symbols. I can believe some frequently used symbols got luckily aligned and that improved overall performance. Interestingly v7_dma_inv_range() and v7_dma_clean_range() were not relocated. ***** I followed Russell's suggestion and added .align 5 to cache-v7.S (see two attached diffs). 1) v4.19 + OpenWrt mtd patchesquoted
egrep -B 1 -A 1 "v7_dma_(inv|clean)_range" System.mapc010ea58 T v7_flush_kern_dcache_area c010ea94 t v7_dma_inv_range c010eae0 t v7_dma_clean_range c010eb18 T b15_dma_flush_range 2) v4.19 + OpenWrt mtd patches + two .align 5 in cache-v7.S c010ea6c T v7_flush_kern_dcache_area c010eac0 t v7_dma_inv_range c010eb20 t v7_dma_clean_range c010eb58 T b15_dma_flush_range (actually 15 symbols above v7_dma_inv_range were replaced) This method seems to be somehow working (at least affects addresses in System.map). ***** I run 2 tests for each combination of changes. Each test consisted of 10 sequences of: 30 seconds iperf session + reboot.quoted
git reset --hard v4.19 git am OpenWrt-mtd-chages.patchTest #1: 738 Mb/s Test #2: 737 Mb/squoted
git reset --hard v4.19 git am OpenWrt-mtd-chages.patchpatch -p1 < v7_dma_clean_range-align.diff Test #1: 746 Mb/s Test #2: 747 Mb/squoted
git reset --hard v4.19 git am OpenWrt-mtd-chages.patch patch -p1 < v7_dma_inv_range-align.diffTest #1: 745 Mb/s Test #2: 746 Mb/squoted
git reset --hard v4.19 git am OpenWrt-mtd-chages.patch patch -p1 < v7_dma_clean_range-align.diff patch -p1 < v7_dma_inv_range-align.diffTest #1: 762 Mb/s Test #2: 761 Mb/s As you can see I got a quite nice performance improvement after aligning both: v7_dma_clean_range() and v7_dma_inv_range().
This is an improvement of about 3.3%.
It still wasn't as good as with 9316a9ed6895 cherry-picked but pretty close.quoted
git reset --hard v4.19 git am OpenWrt-mtd-chages.patch git cherry-pick -x 9316a9ed6895Test #1: 770 Mb/s Test #2: 766 Mb/squoted
git reset --hard v4.19 git am OpenWrt-mtd-chages.patch git cherry-pick -x 9316a9ed6895 patch -p1 < v7_dma_clean_range-align.diffTest #1: 756 Mb/s Test #2: 759 Mb/squoted
git reset --hard v4.19 git am OpenWrt-mtd-chages.patch git cherry-pick -x 9316a9ed6895 patch -p1 < v7_dma_inv_range-align.diffTest #1: 758 Mb/s Test #2: 759 Mb/squoted
git reset --hard v4.19 git am OpenWrt-mtd-chages.patch git cherry-pick -x 9316a9ed6895 patch -p1 < v7_dma_clean_range-align.diff patch -p1 < v7_dma_inv_range-align.diffTest #1: 767 Mb/s Test #2: 763 Mb/s Now you can see how unpredictable it is. If I cherry-pick 9316a9ed6895 and do an extra alignment of v7_dma_clean_range() and v7_dma_inv_range() that extra alignment can actually *hurt* NAT performance.
You have a maximum variance of 4Mb/s in your tests which is around 0.5%, and this shows a reduction of 3Mb/s, or 0.4%. If we look at it a different way: - Without the alignment patches, there is a difference of 4% in performance depending on whether 9316a9ed6895 is applied. - With the alignment patches, there is a difference of 0.4% in performance depending on whether 9316a9ed6895 is applied. How can this not be beneficial?
My guess is that: 1) 9316a9ed6895 provides alignment of some very important function(s) 2) DMA alignments on top ^^ provide some gain but also break some align ***** SUMMARY It seems that for Linux 4.19 + my .config I can get a very lucky & optimal alignment of functions by cherry-picking 9316a9ed6895. I thought of checking functions reported by the "perf" tool with CPU usage of 2%+. All following functions keep their original address with 9316a9ed6895: __irqentry_text_end arch_cpu_idle l2c210_clean_range l2c210_inv_range v7_dma_clean_range v7_dma_inv_range Remaining 3 functions got reallocated: -c03e5038 t __netif_receive_skb_core +c03e50b0 t __netif_receive_skb_core -c03c8b1c t bcma_host_soc_read32 +c03c8b94 t bcma_host_soc_read32 -c0475620 T fib_table_lookup +c0475698 T fib_table_lookup I tried aligning all 3 above functions using: __attribute__((aligned(32))) and got 756 Mb/s. It's better but still not ~770 Mb/s. Is there any easy way of identifying which of function alignments got such a big impact on NAT performance? I'd like to get those functions explicitly aligned using assembler/__attribute__/something. What I'm also afraid are false positives. I may end up aligning some unrelated function that just happens to align other ones. Just like cherry-picking 9316a9ed6895 having side-effects and not really fixing anything explicitly.
quoted hunk ↗ jump to hunk
diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S index 215df435bfb9..c60046cd34aa 100644 --- a/arch/arm/mm/cache-v7.S +++ b/arch/arm/mm/cache-v7.S@@ -373,6 +373,8 @@ v7_dma_inv_range: ret lr ENDPROC(v7_dma_inv_range) + .align 5 + /* * v7_dma_clean_range(start,end) * - start - virtual start address of region
quoted hunk ↗ jump to hunk
diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S index 215df435bfb9..0c3999f219ab 100644 --- a/arch/arm/mm/cache-v7.S +++ b/arch/arm/mm/cache-v7.S@@ -340,6 +340,8 @@ ENTRY(v7_flush_kern_dcache_area) ret lr ENDPROC(v7_flush_kern_dcache_area) + .align 5 + /* * v7_dma_inv_range(start,end) *
-- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel