Re: [PATCH net-next 1/2] net: dsa: mxl862xx: add CRC for MDIO communication

From: Daniel Golle <daniel@makrotopia.org>
Date: 2026-03-21 19:51:13
Also in: lkml

On Sat, Mar 21, 2026 at 08:29:20PM +0100, Andrew Lunn wrote:

quoted

This is the question, what to do when you see a checksum failure? The
basic assumption is MDIO is reliable. PHYs don't do any sort of
checksums, nor any other switches using MDIO. Why would you want
checksums?

To detect the hardware is broken? If so, is returning EIO sufficient?
Would it not be better to admin down all the interfaces?

To allow the MDIO clock to run at a higher frequency, at the limit of
the bus, so you get occasionally failures? If so, should you not
retry? But are the commands idempotent? Can you safely retry?

Your guesses are all correct, and your concerns are justified.

Let me explain the whole picture:
The switch driver transfers many rather large data structures over
MDIO, and lacking support for interrupts (firmware doesn't support),
this is often even interleaved with polling 8 PHYs and at least one
PCS. To not have things get very slow (and then even sometimes see
-ETIMEDOUT breaking the PHY state machine if timing is unlucky), it
is common to overclock the MDIO bus to 20 MHz instead of the 2.5 MHz
default (the switch claims to support up to 25 MHz, but 20 Mhz is
sufficient and conservative enough).

That, combined with higher temperature of the switch IC (but still
within the spec'ed range), can lead to bit-errors -- which, in case
they remain unnoticed can introduce subtle (security relevant) issues
such as bridging ports which should not be bridged or flooding to a
port on which flooding should be disabled.

O.K. "Interesting" design.

🤐

You could solve the PHY timeout issue by claiming to support PHY
interrupts, doing the polling in the DSA driver, and raise an
interrupt if the conditions are met. The mv88e6xxx driver does
something like this. It has an interrupt controller which the PHYs are
connect to. Some designs have the switch interrupt output connected to
a GPIO and so can do real interrupts. Some don't. Rather than have all
the internal PHYs polled one per second by phylib, the mv88e6xxx polls
the interrupt status register every 1/5 of a second and raises the
interrupts instead. Bot faster, and less MDIO transfers.

Sadly, the firmware doesn't provide any API for that. The *hardware*
would likely be capable, even has an external pin to signal interrupts
to the host. But no firmware API to setup/mask interrupts, nor to read
or clear the interrupt status.

Raw access to the relevant hardware registers representing the
interrupt controller also isn't an option, as the only documented raw
register access interface only covers the switch engine itself, and
another (undocumented) more broad debugging register access API only
allows write operations on a very limited range of registers (and the
firmware applies bitmasks for each of them individually) basically
covering the GPIO and LED controller parts...

Plus, most likely interrupts of the built-in PHYs are as broken as for
all other GPY2xx PHYs (see PHY driver comments), because it's more or
less the same IP.

tl;dr: Due to firmware and hardware limitation that's not an option.

I assume the vendor driver does not attempt a retry?

The vendor driver doesn't retry, it merely reports CRC errors.
I thought about it, and retrying would be an option, as on pure WRITE
operations the host could retry sending (as the original request is
dropped by the firmware in case of a CRC error), and READ operations
could only repeat the data retreaval part, but not the actual command
itself. However, I don't think any of that is worth the added complexity.
If the we end up with CRC errors, something is very wrong, most likely
overheat, so trying to forcefully continue normal operation imho isn't
even a good idea even if somehow possible.

quoted

Setting all interfaces to admin-down is probably the best compromise
in a case like this, as it would also reduce the energy consumption
and hence heat emission of the IC (as all built-in PHYs are then down;
that's where most of the heat comes from) and prevent damage -- I've
only observed CRC errors with the heatsink removed and artifically
overheating the IC...

So in the normal use cases you don't expect CRC errors. That seems
like it should driver the design. Consider any CRCs as fatal and
shutdown.

Ack, I also think that's the most reasonable thing to do.

As the shutdown itself can trigger more CRC errors (PHYs being put
into PDOWN, for example, will cause dozends of MDIO operations to
indirectly access the internal MDIO bus via the firmware interface...)
I will guard this using an atomic flag to prevent the ongoing shutdown
from triggering another shutdown...

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help