Thread (37 messages) 37 messages, 8 authors, 2021-08-20

RE: [PATCH net-next] stmmac: align RX buffers

From: Joakim Zhang <hidden>
Date: 2021-08-11 10:56:46
Also in: lkml, netdev

-----Original Message-----
From: Thierry Reding <redacted>
Sent: 2021年8月11日 18:42
To: Marc Zyngier <maz@kernel.org>
Cc: Matteo Croce <redacted>; netdev@vger.kernel.org;
linux-kernel@vger.kernel.org; linux-riscv@lists.infradead.org; Giuseppe
Cavallaro [off-list ref]; Alexandre Torgue
[off-list ref]; David S. Miller [off-list ref];
Jakub Kicinski [off-list ref]; Palmer Dabbelt [off-list ref];
Paul Walmsley [off-list ref]; Drew Fustini
[off-list ref]; Emil Renner Berthing [off-list ref]; Jon
Hunter [off-list ref]; Will Deacon [off-list ref]
Subject: Re: [PATCH net-next] stmmac: align RX buffers

On Tue, Aug 10, 2021 at 08:07:47PM +0100, Marc Zyngier wrote:
quoted
Hi all,

[adding Thierry, Jon and Will to the fun]

On Mon, 14 Jun 2021 03:25:04 +0100,
Matteo Croce [off-list ref] wrote:
quoted
From: Matteo Croce <redacted>

On RX an SKB is allocated and the received buffer is copied into it.
But on some architectures, the memcpy() needs the source and
destination buffers to have the same alignment to be efficient.

This is not our case, because SKB data pointer is misaligned by two
bytes to compensate the ethernet header.

Align the RX buffer the same way as the SKB one, so the copy is faster.
An iperf3 RX test gives a decent improvement on a RISC-V machine:

before:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   733 MBytes   615 Mbits/sec   88
sender
quoted
quoted
[  5]   0.00-10.01  sec   730 MBytes   612 Mbits/sec
receiver
quoted
quoted
after:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.10 GBytes   942 Mbits/sec    0
sender
quoted
quoted
[  5]   0.00-10.00  sec  1.09 GBytes   940 Mbits/sec
receiver
quoted
quoted
And the memcpy() overhead during the RX drops dramatically.

before:
Overhead  Shared O  Symbol
  43.35%  [kernel]  [k] memcpy
  33.77%  [kernel]  [k] __asm_copy_to_user
   3.64%  [kernel]  [k] sifive_l2_flush64_range

after:
Overhead  Shared O  Symbol
  45.40%  [kernel]  [k] __asm_copy_to_user
  28.09%  [kernel]  [k] memcpy
   4.27%  [kernel]  [k] sifive_l2_flush64_range

Signed-off-by: Matteo Croce <redacted>
This patch completely breaks my Jetson TX2 system, composed of 2
Nvidia Denver and 4 Cortex-A57, in a very "funny" way.

Any significant amount of traffic result in all sort of corruption
(ssh connections get dropped, Debian packages downloaded have the
wrong checksums) if any Denver core is involved in any significant way
(packet processing, interrupt handling). And it is all triggered by
this very change.

The only way I have to make it work on a Denver core is to route the
interrupt to that particular core and taskset the workload to it. Any
other configuration involving a Denver CPU results in some sort of
corruption. On their own, the A57s are fine.

This smells of memory ordering going really wrong, which this change
would expose. I haven't had a chance to dig into the driver yet (it
took me long enough to bisect it), but if someone points me at what is
supposed to synchronise the DMA when receiving an interrupt, I'll have
a look.
One other thing that kind of rings a bell when reading DMA and interrupts is a
recent report (and attempt to fix this) where upon resume from system
suspend, the DMA descriptors would get corrupted.

I don't think we ever figured out what exactly the problem was, but
interestingly the fix for the issue immediately caused things to go haywire on...
Jetson TX2.

I recall looking at this a bit and couldn't find where exactly the DMA was being
synchronized on suspend/resume, or what the mechanism was to ensure that
(in transit) packets were not received after the suspension of the Ethernet
device. Some information about this can be found here:

	https://lore.kernel.org/netdev/708edb92-a5df-ecc4-3126-5ab36707e275
@nvidia.com/

It's interesting that this happens only on Jetson TX2. Apparently on the newer
Jetson AGX Xavier this problem does not occur. I think Jon also narrowed this
down to being related to the IOMMU being enabled on Jetson TX2, whereas
Jetson AGX Xavier didn't have it enabled. I wasn't able to find any notes on
whether disabling the IOMMU on Jetson TX2 did anything to improve on this,
so perhaps that's something worth trying.

We have since enabled the IOMMU on Jetson AGX Xavier, and I haven't seen
any test reports indicating that this is causing issues. So I don't think this has
anything directly to do with the IOMMU support.

That said, if these problems are all exclusive to Jetson TX2, or rather Tegra186,
that could indicate that we're missing something at a more fundamental level
(maybe some cache maintenance quirk?).

Hey Thierry,

Please also notice me if you found the root cause, that would be appreciated!
I have not upstream the fix you mentioned yet since your continuous NACK.

Thanks in advance 😊

Best Regards,
Joakim Zhang
Thierry
quoted
quoted
---
 drivers/net/ethernet/stmicro/stmmac/stmmac.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index b6cd43eda7ac..04bdb3950d63 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -338,9 +338,9 @@ static inline bool stmmac_xdp_is_enabled(struct
stmmac_priv *priv)  static inline unsigned int
stmmac_rx_offset(struct stmmac_priv *priv)  {
 	if (stmmac_xdp_is_enabled(priv))
-		return XDP_PACKET_HEADROOM;
+		return XDP_PACKET_HEADROOM + NET_IP_ALIGN;

-	return 0;
+	return NET_SKB_PAD + NET_IP_ALIGN;
 }

 void stmmac_disable_rx_queue(struct stmmac_priv *priv, u32 queue);
--
2.31.1
--
Without deviation from the norm, progress is not possible.
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help