Thread (22 messages) 22 messages, 6 authors, 2022-05-10

Re: Optimizing kernel compilation / alignments for network performance

From: Arnd Bergmann <arnd@arndb.de>
Date: 2022-05-10 13:20:03
Also in: linux-arm-kernel

On Tue, May 10, 2022 at 1:23 PM Rafał Miłecki [off-list ref] wrote:
On 6.05.2022 10:45, Arnd Bergmann wrote:
quoted
- The higher-end networking SoCs are usually cache-coherent and
   can avoid the cache management entirely. There is a slim chance
   that this chip is designed that way and it just needs to be enabled
   properly. Most low-end chips don't implement the coherent
   interconnect though, and I suppose you have checked this already.
To my best knowledge Northstar platform doesn't support hw coherency.

I just took an extra look at Broadcom's SDK and them seem to have some
driver for selected chipsets but BCM708 isn't there.

config BCM_GLB_COHERENCY
        bool "Global Hardware Cache Coherency"
        default n
        depends on BCM963158 || BCM96846 || BCM96858 || BCM96856 || BCM963178 || BCM947622 || BCM963146  || BCM94912 || BCM96813 || BCM96756 || BCM96855
Ok
quoted
- bgmac_dma_rx_update_index() and bgmac_dma_tx_add() appear
   to have an extraneous dma_wmb(), which should be implied by the
   non-relaxed writel() in bgmac_write().
I tried dropping wmb() calls.
With wmb(): 421 Mb/s
Without: 418 Mb/s
That's probably within the noise here. I suppose doing two wmb()
calls in a row is not that expensive because there is nothing left to
wait for. If the extra wmb() is measurably faster than no wmb(), there
is something else going wrong ;-)
I also tried dropping bgmac_read() from bgmac_chip_intrs_off() which
seems to be a flushing readback.

With bgmac_read(): 421 Mb/s
Without: 413 Mb/s
Interesting, so this is statistically significant, right? It could be that
this changing the interrupt timing just enough that it ends up doing
more work at once some of the time.
quoted
- accesses to the DMA descriptor don't show up in the profile here,
   but look like they can get misoptimized by the compiler. I would
   generally use READ_ONCE() and WRITE_ONCE() for these to
   ensure that you don't end up with extra or out-of-order accesses.
   This also makes it clearer to the reader that something special
   happens here.
Should I use something as below?

FWIW it doesn't seem to change NAT performance.
Without WRITE_ONCE: 421 Mb/s
With: 419 Mb/s
This one depends on the compiler. What I would expect here is that
it often makes no difference, but if the compiler does something
odd, then the WRITE_ONCE() would prevent this and make it behave
as before. I would suggest adding this part regardless.

The other suggestion I had was this, I think you did not test this:
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -1156,11 +1156,12 @@ static int bgmac_poll(struct napi_struct
*napi, int weight)
        bgmac_dma_tx_free(bgmac, &bgmac->tx_ring[0]);
        handled += bgmac_dma_rx_read(bgmac, &bgmac->rx_ring[0], weight);

-       /* Poll again if more events arrived in the meantime */
-       if (bgmac_read(bgmac, BGMAC_INT_STATUS) & (BGMAC_IS_TX0 | BGMAC_IS_RX))
-               return weight;
-
        if (handled < weight) {
+               /* Poll again if more events arrived in the meantime */
+               if (bgmac_read(bgmac, BGMAC_INT_STATUS) &
+                               (BGMAC_IS_TX0 | BGMAC_IS_RX))
+                       return weight;
+
                napi_complete_done(napi, handled);
                bgmac_chip_intrs_on(bgmac);
        }

Or possibly, remove that extra check entirely and just rely on the irq to do
this after it gets turned on again.

         Arnd
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help