Re: Optimizing kernel compilation / alignments for network performance

From: Rafał Miłecki <zajec5@gmail.com>
Date: 2022-05-10 11:25:00
Also in: linux-arm-kernel
Subsystem: broadcom iproc gbit ethernet driver, networking drivers, the rest · Maintainers: Rafał Miłecki, Andrew Lunn, "David S. Miller", Eric Dumazet, Jakub Kicinski, Paolo Abeni, Linus Torvalds

On 6.05.2022 10:45, Arnd Bergmann wrote:

On Fri, May 6, 2022 at 9:44 AM Rafał Miłecki [off-list ref] wrote:

quoted

On 5.05.2022 18:04, Andrew Lunn wrote:

quoted

you'll see that most used functions are:
v7_dma_inv_range
__irqentry_text_end
l2c210_inv_range
v7_dma_clean_range
bcma_host_soc_read32
__netif_receive_skb_core
arch_cpu_idle
l2c210_clean_range
fib_table_lookup

There is a lot of cache management functions here.

Indeed, so optimizing the coherency management (see Felix' reply)
is likely to help most in making the driver faster, but that does not
explain why the alignment of the object code has such a big impact
on performance.

To investigate the alignment further, what I was actually looking for
is a comparison of the profile of the slow and fast case. Here I would
expect that the slow case spends more time in one of the functions
that don't deal with cache management (maybe fib_table_lookup or
__netif_receive_skb_core).

A few other thoughts:

- bcma_host_soc_read32() is a fundamentally slow operation, maybe
   some of the calls can turned into a relaxed read, like the readback
   in bgmac_chip_intrs_off() or the 'poll again' at the end bgmac_poll(),
   though obviously not the one in bgmac_dma_rx_read().
   It may be possible to even avoid some of the reads entirely, checking
   for more data in bgmac_poll() may actually be counterproductive
   depending on the workload.

I'll experiment with that, hopefully I can optimize it a bit.

- The higher-end networking SoCs are usually cache-coherent and
   can avoid the cache management entirely. There is a slim chance
   that this chip is designed that way and it just needs to be enabled
   properly. Most low-end chips don't implement the coherent
   interconnect though, and I suppose you have checked this already.

To my best knowledge Northstar platform doesn't support hw coherency.

I just took an extra look at Broadcom's SDK and them seem to have some
driver for selected chipsets but BCM708 isn't there.

config BCM_GLB_COHERENCY
	bool "Global Hardware Cache Coherency"
	default n
	depends on BCM963158 || BCM96846 || BCM96858 || BCM96856 || BCM963178 || BCM947622 || BCM963146  || BCM94912 || BCM96813 || BCM96756 || BCM96855

- bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
   to have an extraneous dma_wmb(), which should be implied by the
   non-relaxed writel() in bgmac_write().

I tried dropping wmb() calls.
With wmb(): 421 Mb/s
Without: 418 Mb/s


I also tried dropping bgmac_read() from bgmac_chip_intrs_off() which
seems to be a flushing readback.

With bgmac_read(): 421 Mb/s
Without: 413 Mb/s

- accesses to the DMA descriptor don't show up in the profile here,
   but look like they can get misoptimized by the compiler. I would
   generally use READ_ONCE() and WRITE_ONCE() for these to
   ensure that you don't end up with extra or out-of-order accesses.
   This also makes it clearer to the reader that something special
   happens here.

Should I use something as below?

FWIW it doesn't seem to change NAT performance.
Without WRITE_ONCE: 421 Mb/s
With: 419 Mb/s

diff --git a/drivers/net/ethernet/broadcom/bgmac.c b/drivers/net/ethernet/broadcom/bgmac.c
index 87700072..ce98f2a9 100644
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c

@@ -119,10 +119,10 @@ bgmac_dma_tx_add_buf(struct bgmac *bgmac, struct bgmac_dma_ring *ring,

  	slot = &ring->slots[i];
  	dma_desc = &ring->cpu_base[i];
-	dma_desc->addr_low = cpu_to_le32(lower_32_bits(slot->dma_addr));
-	dma_desc->addr_high = cpu_to_le32(upper_32_bits(slot->dma_addr));
-	dma_desc->ctl0 = cpu_to_le32(ctl0);
-	dma_desc->ctl1 = cpu_to_le32(ctl1);
+	WRITE_ONCE(dma_desc->addr_low, cpu_to_le32(lower_32_bits(slot->dma_addr)));
+	WRITE_ONCE(dma_desc->addr_high, cpu_to_le32(upper_32_bits(slot->dma_addr)));
+	WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0));
+	WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1));
  }

  static netdev_tx_t bgmac_dma_tx_add(struct bgmac *bgmac,

@@ -387,10 +387,10 @@ static void bgmac_dma_rx_setup_desc(struct bgmac *bgmac,
  	 * B43_DMA64_DCTL1_ADDREXT_MASK;
  	 */

-	dma_desc->addr_low = cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr));
-	dma_desc->addr_high = cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr));
-	dma_desc->ctl0 = cpu_to_le32(ctl0);
-	dma_desc->ctl1 = cpu_to_le32(ctl1);
+	WRITE_ONCE(dma_desc->addr_low, cpu_to_le32(lower_32_bits(ring->slots[desc_idx].dma_addr)));
+	WRITE_ONCE(dma_desc->addr_high, cpu_to_le32(upper_32_bits(ring->slots[desc_idx].dma_addr)));
+	WRITE_ONCE(dma_desc->ctl0, cpu_to_le32(ctl0));
+	WRITE_ONCE(dma_desc->ctl1, cpu_to_le32(ctl1));

  	ring->end = desc_idx;
  }

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help