Re: ARM router NAT performance affected by random/unrelated commits | linux-arm-kernel

quoted

On Wed, May 22, 2019 at 01:51:01PM +0200, Rafał Miłecki wrote:
On 21.05.2019 12:45, Russell King - ARM Linux admin wrote:> On Tue, May 21, 2019 at 12:28:48PM +0200, Rafał Miłecki wrote:
I work on home routers based on Broadcom's Northstar SoCs. Those devices
have ARM Cortex-A9 and most of them are dual-core.

As for home routers, my main concern is network performance. That CPU
isn't powerful enough to handle gigabit traffic so all kind of
optimizations do matter. I noticed some unexpected changes in NAT
performance when switching between kernels.

My hardware is BCM47094 SoC (dual core ARM) with integrated network
controller and external BCM53012 switch.
Guessing, I'd say it's to do with the placement of code wrt cachelines.
You could try aligning some of the cache flushing code to a cache line
and see what effect that has.
Is System.map a good place to check for functions code alignment?

With Linux 4.19 + OpenWrt mtd patches I have:
(...)
c010ea94 t v7_dma_inv_range
c010eae0 t v7_dma_clean_range
(...)
c02ca3d0 T blk_mq_update_nr_hw_queues
c02ca69c T blk_mq_alloc_tag_set
c02ca94c T blk_mq_release
c02ca9b4 T blk_mq_free_queue
c02caa88 T blk_mq_update_nr_requests
c02cab50 T blk_mq_unique_tag
(...)

After cherry-picking 9316a9ed6895 ("blk-mq: provide helper for setting
up an SQ queue and tag set"):
(...)
c010ea94 t v7_dma_inv_range
c010eae0 t v7_dma_clean_range
(...)
c02ca3d0 T blk_mq_update_nr_hw_queues
c02ca69c T blk_mq_alloc_tag_set
c02ca94c T blk_mq_init_sq_queue <-- NEW
c02ca9c0 T blk_mq_release <-- Different address of this & all below
c02caa28 T blk_mq_free_queue
c02caafc T blk_mq_update_nr_requests
c02cabc4 T blk_mq_unique_tag
(...)

As you can see blk_mq_init_sq_queue has appeared in the System.map and
it affected addresses of ~30000 symbols. I can believe some frequently
used symbols got luckily aligned and that improved overall performance.

Interestingly v7_dma_inv_range() and v7_dma_clean_range() were not
relocated.

*****

I followed Russell's suggestion and added .align 5 to cache-v7.S (see
two attached diffs).

1) v4.19 + OpenWrt mtd patches
egrep -B 1 -A 1 "v7_dma_(inv|clean)_range" System.map
c010ea58 T v7_flush_kern_dcache_area
c010ea94 t v7_dma_inv_range
c010eae0 t v7_dma_clean_range
c010eb18 T b15_dma_flush_range

2) v4.19 + OpenWrt mtd patches + two .align 5 in cache-v7.S
c010ea6c T v7_flush_kern_dcache_area
c010eac0 t v7_dma_inv_range
c010eb20 t v7_dma_clean_range
c010eb58 T b15_dma_flush_range
(actually 15 symbols above v7_dma_inv_range were replaced)

This method seems to be somehow working (at least affects addresses in
System.map).

*****

I run 2 tests for each combination of changes. Each test consisted of
10 sequences of: 30 seconds iperf session + reboot.


git reset --hard v4.19
git am OpenWrt-mtd-chages.patch
Test #1: 738 Mb/s
Test #2: 737 Mb/s

git reset --hard v4.19
git am OpenWrt-mtd-chages.patch
patch -p1 < v7_dma_clean_range-align.diff
Test #1: 746 Mb/s
Test #2: 747 Mb/s

git reset --hard v4.19
git am OpenWrt-mtd-chages.patch
patch -p1 < v7_dma_inv_range-align.diff
Test #1: 745 Mb/s
Test #2: 746 Mb/s

git reset --hard v4.19
git am OpenWrt-mtd-chages.patch
patch -p1 < v7_dma_clean_range-align.diff
patch -p1 < v7_dma_inv_range-align.diff
Test #1: 762 Mb/s
Test #2: 761 Mb/s

As you can see I got a quite nice performance improvement after aligning
both: v7_dma_clean_range() and v7_dma_inv_range().
This is an improvement of about 3.3%.

It still wasn't as good as with 9316a9ed6895 cherry-picked but pretty
close.


git reset --hard v4.19
git am OpenWrt-mtd-chages.patch
git cherry-pick -x 9316a9ed6895
Test #1: 770 Mb/s
Test #2: 766 Mb/s

git reset --hard v4.19
git am OpenWrt-mtd-chages.patch
git cherry-pick -x 9316a9ed6895
patch -p1 < v7_dma_clean_range-align.diff
Test #1: 756 Mb/s
Test #2: 759 Mb/s

git reset --hard v4.19
git am OpenWrt-mtd-chages.patch
git cherry-pick -x 9316a9ed6895
patch -p1 < v7_dma_inv_range-align.diff
Test #1: 758 Mb/s
Test #2: 759 Mb/s

git reset --hard v4.19
git am OpenWrt-mtd-chages.patch
git cherry-pick -x 9316a9ed6895
patch -p1 < v7_dma_clean_range-align.diff
patch -p1 < v7_dma_inv_range-align.diff
Test #1: 767 Mb/s
Test #2: 763 Mb/s

Now you can see how unpredictable it is. If I cherry-pick 9316a9ed6895
and do an extra alignment of v7_dma_clean_range() and v7_dma_inv_range()
that extra alignment can actually *hurt* NAT performance.
You have a maximum variance of 4Mb/s in your tests which is around
0.5%, and this shows a reduction of 3Mb/s, or 0.4%.

If we look at it a different way:
- Without the alignment patches, there is a difference of 4% in
  performance depending on whether 9316a9ed6895 is applied.
- With the alignment patches, there is a difference of 0.4% in
  performance depending on whether 9316a9ed6895 is applied.

How can this not be beneficial?

My guess is that:
1) 9316a9ed6895 provides alignment of some very important function(s)
2) DMA alignments on top ^^ provide some gain but also break some align

*****

SUMMARY

It seems that for Linux 4.19 + my .config I can get a very lucky &
optimal alignment of functions by cherry-picking 9316a9ed6895.

I thought of checking functions reported by the "perf" tool with CPU
usage of 2%+.

All following functions keep their original address with 9316a9ed6895:
__irqentry_text_end
arch_cpu_idle
l2c210_clean_range
l2c210_inv_range
v7_dma_clean_range
v7_dma_inv_range

Remaining 3 functions got reallocated:
-c03e5038 t __netif_receive_skb_core
+c03e50b0 t __netif_receive_skb_core
-c03c8b1c t bcma_host_soc_read32
+c03c8b94 t bcma_host_soc_read32
-c0475620 T fib_table_lookup
+c0475698 T fib_table_lookup

I tried aligning all 3 above functions using:
__attribute__((aligned(32)))
and got 756 Mb/s. It's better but still not ~770 Mb/s.

Is there any easy way of identifying which of function alignments got
such a big impact on NAT performance? I'd like to get those functions
explicitly aligned using assembler/__attribute__/something.

What I'm also afraid are false positives. I may end up aligning some
unrelated function that just happens to align other ones. Just like
cherry-picking 9316a9ed6895 having side-effects and not really fixing
anything explicitly.

diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
index 215df435bfb9..c60046cd34aa 100644
--- a/arch/arm/mm/cache-v7.S
+++ b/arch/arm/mm/cache-v7.S
@@ -373,6 +373,8 @@ v7_dma_inv_range:
 	ret	lr
 ENDPROC(v7_dma_inv_range)
 
+	.align	5
+
 /*
  *	v7_dma_clean_range(start,end)
  *	- start   - virtual start address of region

diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
index 215df435bfb9..0c3999f219ab 100644
--- a/arch/arm/mm/cache-v7.S
+++ b/arch/arm/mm/cache-v7.S
@@ -340,6 +340,8 @@ ENTRY(v7_flush_kern_dcache_area)
 	ret	lr
 ENDPROC(v7_flush_kern_dcache_area)
 
+	.align	5
+
 /*
  *	v7_dma_inv_range(start,end)
  *

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help