Re: Optimizing kernel compilation / alignments for network performance

From: Arnd Bergmann <arnd@arndb.de>
Date: 2022-05-10 13:20:03
Also in: linux-arm-kernel

On Tue, May 10, 2022 at 1:23 PM Rafał Miłecki [off-list ref] wrote:

On 6.05.2022 10:45, Arnd Bergmann wrote:

quoted

- The higher-end networking SoCs are usually cache-coherent and
   can avoid the cache management entirely. There is a slim chance
   that this chip is designed that way and it just needs to be enabled
   properly. Most low-end chips don't implement the coherent
   interconnect though, and I suppose you have checked this already.

To my best knowledge Northstar platform doesn't support hw coherency.

I just took an extra look at Broadcom's SDK and them seem to have some
driver for selected chipsets but BCM708 isn't there.

config BCM_GLB_COHERENCY
        bool "Global Hardware Cache Coherency"
        default n
        depends on BCM963158 || BCM96846 || BCM96858 || BCM96856 || BCM963178 || BCM947622 || BCM963146  || BCM94912 || BCM96813 || BCM96756 || BCM96855

Ok

quoted

- bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
   to have an extraneous dma_wmb(), which should be implied by the
   non-relaxed writel() in bgmac_write().

I tried dropping wmb() calls.
With wmb(): 421 Mb/s
Without: 418 Mb/s

That's probably within the noise here. I suppose doing two wmb()
calls in a row is not that expensive because there is nothing left to
wait for. If the extra wmb() is measurably faster than no wmb(), there
is something else going wrong ;-)

I also tried dropping bgmac_read() from bgmac_chip_intrs_off() which
seems to be a flushing readback.

With bgmac_read(): 421 Mb/s
Without: 413 Mb/s

Interesting, so this is statistically significant, right? It could be that
this changing the interrupt timing just enough that it ends up doing
more work at once some of the time.

quoted

- accesses to the DMA descriptor don't show up in the profile here,
   but look like they can get misoptimized by the compiler. I would
   generally use READ_ONCE() and WRITE_ONCE() for these to
   ensure that you don't end up with extra or out-of-order accesses.
   This also makes it clearer to the reader that something special
   happens here.

Should I use something as below?

FWIW it doesn't seem to change NAT performance.
Without WRITE_ONCE: 421 Mb/s
With: 419 Mb/s

This one depends on the compiler. What I would expect here is that
it often makes no difference, but if the compiler does something
odd, then the WRITE_ONCE() would prevent this and make it behave
as before. I would suggest adding this part regardless.

The other suggestion I had was this, I think you did not test this:

--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c

@@ -1156,11 +1156,12 @@ static int bgmac_poll(struct napi_struct

*napi, int weight)
        bgmac_dma_tx_free(bgmac, &bgmac->tx_ring[0]);
        handled += bgmac_dma_rx_read(bgmac, &bgmac->rx_ring[0], weight);

-       /* Poll again if more events arrived in the meantime */
-       if (bgmac_read(bgmac, BGMAC_INT_STATUS) & (BGMAC_IS_TX0 | BGMAC_IS_RX))
-               return weight;
-
        if (handled < weight) {
+               /* Poll again if more events arrived in the meantime */
+               if (bgmac_read(bgmac, BGMAC_INT_STATUS) &
+                               (BGMAC_IS_TX0 | BGMAC_IS_RX))
+                       return weight;
+
                napi_complete_done(napi, handled);
                bgmac_chip_intrs_on(bgmac);
        }

Or possibly, remove that extra check entirely and just rely on the irq to do
this after it gets turned on again.

         Arnd

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help