Thread (37 messages) 37 messages, 8 authors, 2021-08-20

Re: [PATCH net-next] stmmac: align RX buffers

From: Thierry Reding <hidden>
Date: 2021-08-12 14:29:16
Also in: linux-riscv, lkml

On Wed, Aug 11, 2021 at 02:23:10PM +0100, Marc Zyngier wrote:
On Wed, 11 Aug 2021 11:41:59 +0100,
Thierry Reding [off-list ref] wrote:
quoted
On Tue, Aug 10, 2021 at 08:07:47PM +0100, Marc Zyngier wrote:
quoted
Hi all,

[adding Thierry, Jon and Will to the fun]

On Mon, 14 Jun 2021 03:25:04 +0100,
Matteo Croce [off-list ref] wrote:
quoted
From: Matteo Croce <redacted>

On RX an SKB is allocated and the received buffer is copied into it.
But on some architectures, the memcpy() needs the source and destination
buffers to have the same alignment to be efficient.

This is not our case, because SKB data pointer is misaligned by two bytes
to compensate the ethernet header.

Align the RX buffer the same way as the SKB one, so the copy is faster.
An iperf3 RX test gives a decent improvement on a RISC-V machine:

before:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   733 MBytes   615 Mbits/sec   88             sender
[  5]   0.00-10.01  sec   730 MBytes   612 Mbits/sec                  receiver

after:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.10 GBytes   942 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1.09 GBytes   940 Mbits/sec                  receiver

And the memcpy() overhead during the RX drops dramatically.

before:
Overhead  Shared O  Symbol
  43.35%  [kernel]  [k] memcpy
  33.77%  [kernel]  [k] __asm_copy_to_user
   3.64%  [kernel]  [k] sifive_l2_flush64_range

after:
Overhead  Shared O  Symbol
  45.40%  [kernel]  [k] __asm_copy_to_user
  28.09%  [kernel]  [k] memcpy
   4.27%  [kernel]  [k] sifive_l2_flush64_range

Signed-off-by: Matteo Croce <redacted>
This patch completely breaks my Jetson TX2 system, composed of 2
Nvidia Denver and 4 Cortex-A57, in a very "funny" way.

Any significant amount of traffic result in all sort of corruption
(ssh connections get dropped, Debian packages downloaded have the
wrong checksums) if any Denver core is involved in any significant way
(packet processing, interrupt handling). And it is all triggered by
this very change.

The only way I have to make it work on a Denver core is to route the
interrupt to that particular core and taskset the workload to it. Any
other configuration involving a Denver CPU results in some sort of
corruption. On their own, the A57s are fine.

This smells of memory ordering going really wrong, which this change
would expose. I haven't had a chance to dig into the driver yet (it
took me long enough to bisect it), but if someone points me at what is
supposed to synchronise the DMA when receiving an interrupt, I'll have
a look.
One other thing that kind of rings a bell when reading DMA and
interrupts is a recent report (and attempt to fix this) where upon
resume from system suspend, the DMA descriptors would get corrupted.

I don't think we ever figured out what exactly the problem was, but
interestingly the fix for the issue immediately caused things to go
haywire on... Jetson TX2.
I love this machine... Did this issue occur with the Denver CPUs
disabled?
Interestingly I've been doing some work on a newer device called Jetson
TX2 NX (which is kind of a trimmed-down version of Jetson TX2, in the
spirit of the Jetson Nano) and I can't seem to reproduce these failures
there (tested on next-20210812).

I'll go dig out my Jetson TX2 to run the same tests there, because I've
also been using a development version of the bootloader stack and
flashing tools and all that, so it's possible that something was fixed
at that level. I don't think I've ever tried disabling the Denver CPUs,
but then I've also never seen these issues myself.

Just out of curiosity, what version of the BSP have you been using to
flash?

One other thing that I ran into: there's a known issue with the PHY
configuration. We mark the PHY on most devices as "rgmii-id" on most
devices and then the Marvell PHY driver needs to be enabled. Jetson TX2
has phy-mode = "rgmii", so it /should/ work okay.

Typically what we're seeing with that misconfiguration is that the
device fails to get an IP address, but it might still be worth trying to
switch Jetson TX2 to rgmii-id and using the Marvell PHY, to see if that
improves anything.
quoted
I recall looking at this a bit and couldn't find where exactly the DMA
was being synchronized on suspend/resume, or what the mechanism was to
ensure that (in transit) packets were not received after the suspension
of the Ethernet device. Some information about this can be found here:

	https://lore.kernel.org/netdev/708edb92-a5df-ecc4-3126-5ab36707e275@nvidia.com/ (local)

It's interesting that this happens only on Jetson TX2. Apparently on the
newer Jetson AGX Xavier this problem does not occur. I think Jon also
narrowed this down to being related to the IOMMU being enabled on Jetson
TX2, whereas Jetson AGX Xavier didn't have it enabled. I wasn't able to
find any notes on whether disabling the IOMMU on Jetson TX2 did anything
to improve on this, so perhaps that's something worth trying.
Actually, I was running with the SMMU disabled, as I use the upstream
u-boot provided DT. Switching to the kernel one didn't change a thing
(with passthough or not).
quoted
We have since enabled the IOMMU on Jetson AGX Xavier, and I haven't seen
any test reports indicating that this is causing issues. So I don't
think this has anything directly to do with the IOMMU support.
No, it looks more like either ordering or cache management. The fact
that this patch messes with the buffer alignment makes me favour the
latter...
quoted
That said, if these problems are all exclusive to Jetson TX2, or rather
Tegra186, that could indicate that we're missing something at a more
fundamental level (maybe some cache maintenance quirk?).
That'd be pretty annoying. Do you know if the Ethernet is a coherent
device on this machine? or does it need active cache maintenance?
I don't think Ethernet is a coherent device on Tegra186. I think
Tegra194 had various improvements with regard to coherency, but most
devices on Tegra186 do need active cache maintenance.

Let me dig through some old patches and mailing list threads. I vaguely
recall prototyping a patch that did something special for outer cache
flushing, but that may have been Tegra132, not Tegra186. I also don't
think we ended up merging that because it turned out to not be needed.

Thierry

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help