Re: [net-next PATCH v6] net: macb: Fix several edge cases in validate
From: Sean Anderson <hidden>
Date: 2021-11-18 20:20:26
Hi Russell, On 11/18/21 10:34 AM, Russell King (Oracle) wrote:
On Wed, Nov 17, 2021 at 12:22:26AM +0000, Russell King (Oracle) wrote:quoted
On Tue, Nov 16, 2021 at 05:56:43PM -0500, Sean Anderson wrote:quoted
Hi Russell, I have a few questions/comments about your tree (and pl in general). This is not particularly relevant to the above patch, but this is as good a place as any to ask. What is the intent for the supported link modes in validate()? The docs sayThe _link_ modes describe what gets reported to userspace via the ethtool APIs, and therefore what appears in ethtool as the supported and advertised capabilities for the media, whatever the "media" is defined to be. Generally, the "media" is what the user gets to play with to connect two network interfaces together - so twisted pair, fibre, direct-attach cable, etc.quoted
quoted
Note that the PHY may be able to transform from one connection technology to another, so, eg, don't clear 1000BaseX just because the MAC is unable to BaseX mode. This is more about clearing unsupported speeds and duplex settings. The port modes should not be cleared; phylink_set_port_modes() will help with this.But this is not how validate() has been/is currently implemented in many drivers. In 34ae2c09d46a ("net: phylink: add generic validate implementation"), it appears you are hewing closer to the documented purpose (e.g. MAC_1000FD selects all the full-duplex 1G link modes). Should new code try to stick to the above documentation?I try to encourage new code to stick to this - and this is one of the motivations behind moving to this new model, so people don't make these kinds of mistakes. In the case of nothing between the MAC and the media performing any kind of speed conversion, the MAC itself doesn't have much to do with which ethtool link modes are supported - and here's why. Take a gigabit capable MAC that is connected via SGMII to a PHY that supports both conventional twisted-pair media and fiber. If the twisted-pair port is in use at 1G speeds, then we're using 1000base-T. If the fiber port is being used, then we're using 1000base-X. The protocol between the PHY and MAC makes no difference to what link modes are supported. A more extreme case could be: a 10G MAC connected to a backplane PHY via 10G BASE-KR which is then connected to a PHY that connects to conventional twisted-pair media. Or a multi-speed PHY where it switches between SGMII, 2500BASE-X, 5GBASE-R, 10GBASE-R depending on the results of negotiation on the twisted-pair media. The MAC supports operating at 10M, 100M, 1G, 2.5G, 5G, and 10G speeds, and can select between PCS that support SGMII, 2500BASE-X, 5GBASE-R and 10GBASE-R. However, ultimately for userspace, what matters is the media capabilities - the base-T* ethtool link modes. 2500base-X in this situation doesn't come up unless the PHY offers 2500base-X on the media. The same PHY might be embedded within a SFP module, and that SFP module might be plugged into a cage where the MAC is unable to support the faster speeds - maybe it is only capable of up to 2.5G speeds. In which case, the system supports up to 2500BASE-T. So you can see, the MAC side has little relevance for link modes except for defining the speeds and duplexes that can be supported. The type of media (-T, -X, -*R) is of no concern at this stage. It is of little concern at the PCS except when the PCS is the endpoint for connecting to the media (like it is in _some_ 802.3z connections.) I say "some" because its entirely possible to use 1000base-X to talk to a PHY that connects to 1000base-T media (and there are SFPs around that will do this by default.)
Of course, since 1000BASE-X is not an electrical specification, this is really more like using 1000BASE-CX to 1000BASE-T :)
quoted
quoted
Of course, the above leaves me quite confused about where the best place is to let the PCS have a say about what things are supported, and (as discussed below) whether it can support such a thing. The general perspective taken in existing drivers seems to be that PCSs are integrated with the MAC. This is in contrast to the IEEE specs, which take the pespective that the PCS is a part of the PHY. It's unclear to me what stance the above documentation takes.Things can get quite complex, and I have to say the IEEE specs give a simplified view. When you have a SGMII link to a PHY that then connects to twisted pair media, you actually have multiple PCS: PHY /----------------------\ MAC -- PCS ---<SGMII>--- PCS --- PCS --- PMA ---- media (sgmii) (sgmii) (1000baseT) This can be seen in PHYs such as Marvell 88E151x, where the fiber interface is re-used for SGMII, and if you read the datasheet and/or read the fiber page registers, you find that this is being used for the fiber side. So the PHY can be thought of two separate PHYs back-to-back. Remember that the PCS for 1000BASE-X (and SGMII) is different from the PCS for 1000BASE-T in IEEE802.3.
Right and this is a bit of the source of the confusion. There are different levels/layers of PHYs all with their own PCS/PMA/PMD stack. Depending on what perspective you take at the time, some of these can be subsumed into each other.
quoted
The point I'm making here is that the capability of the link between the MAC and the PHY doesn't define what the media link modes are. It may define the speeds and duplexes that can be supported, and that restricts the available link modes, but it doesn't define which media "technologies" can be supported.
Right. IMO there is a lot of conflation of this concept in the current net subsystem. Realistically, the MAC should only be concerned with the phy interface mode, and perhaps duplex and speed. But! This should be the interface mode needed to talk to *the next stage in the signal path*. That is, if the MAC has GMII output and needs a separate PCS to talk 1000BASE-X or SGMII, it should only report GMII. And then the PCS can say what kind of interface it supports. However, the current model assumes that the PCS is tightly integrated, so these sorts of things are not modeled well. I don't know whether the above change would be feasable at all. Ideally, validate() would talk about interfaces modes and not link modes.
quoted
Hence, for example, the validate() implementation masking out 1000base-X but leaving 1000base-T on a *GMII link is pretty silly, because whether one or the other is supported depends ultimately what the *GMII link ends up being connected to.quoted
Consider the Xilinx 1G PCS. This PCS supports 1000BASE-X and SGMII, but only at full duplex. This caveat already rules out a completely bitmap-based solution (as phylink_get_linkmodes thinks that both of those interfaces are always full-duplex).I don't see why you say this rules out a bitmap solution. You say that it only supports full-duplex, and that is catered for in the current solution: MAC_10 for example is actually MAC_10HD | MAC_10FD - which allows one to specify that only MAC_10FD is supported and not MAC_10HD in such a scenario.
Say that you are a MAC with an integrated PCS (so you handle everything in the MAC driver). You support GMII full and half duplex, but your PCS only supports 1000BASE-X with full duplex. The naïve bitmap is supported_interfaces = PHY_INTERFACE_GMII | PHY_INTERFACE_1000BASEX; mac_capabilities = MAC_10 | MAC_100 | MAC_1000; but this will report 1000BASE-X as supporting both full and half duplex. So you still need a custom validate() in order to report the correct link modes. The tricky part comes in a scenario where the exact MAC is determined at runtime, such as the MACB+Xilinx PCS configuration.
quoted
Hmm. Also note that the validate() callback is not going away - phylink_generic_validate() is a generic implementation of this that gets rid of a lot of duplication and variability of implementation that really shouldn't be there. There are cases where the generic implementation will not be suitable, and for this phylink_get_linkmodes() can be called directly, or I'd even consider making phylink_caps_to_linkmodes() available if it is useful. Or one can do it the "old way" that we currently have.quoted
There are also config options which (either as a feature or a consequence) disable SPEED_10 SGMII or autonegotiation (although I don't plan on supporting either of those). The "easiest" solution is simply to provide two callbacks like void pcs_set_modes(struct phylink_pcs *pcs, ulong *supported, phy_interface_t interface); bool pcs_mode_supported(struct phylink_pcs *pcs, phy_interface_t interface, int speed, int duplex); perhaps with some generic substitutes. The former would need to be called from mac_validate, and the latter from mac_select_pcs/ mac_prepare. This design is rather averse to genericization, so perhaps you have some suggestion?I don't have a good answer for you at the moment - the PCS support is something that has been recently added and is still quite young, so these are the kinds of issues I'd expect to crop up.quoted
On the subject of PCS selection, mac_select_pcs should supply the whole state.That may seem like a good thing to ask for, but not even phylink knows what the full state is when calling the validation function, nor when calling mac_select_pcs. Let's take an example of the Marvell 88X3310 multi-speed PHY, which supports 10G, 5G, 2.5G, 1G, 100M and 10M on copper, and 1G and 100M on fiber, and can do all of that while connected to a single serdes connection back to the MAC. As I've said above, it does this by switching its MAC connection under its internal firmware between 10000Base-R, 5000Base-R, 2500Base-X, and SGMII. This PHY has been found to be used in platforms, and discovered to also be in SFP modules. Beyond programming the overall "host interface" mode, we don't get a choice in which mode the PHY picks - that is determined by the results of which interface comes up and autonegotiation on that interface. So, if the PHY decides to link on copper at 2500BASE-T, then we end up with the MAC link operating at 2500BASE-X, and there's nothing we can do about that. The only way to restrict this is to know ahead of time what the capabilities of the MAC and PCSes are, and to restrict the link modes that phylib gives us in both the "supported" and "advertising" fields, so the PHY will be programmed to e.g. not support 2500BASE-T on copper if 2500BASE-X is not supported by the PCS, or 2.5G speeds are not supported by the MAC. This isn't something one can do when trying to bring the link up, it's something that needs to be done when we are "putting the system together" - in other words, when we are binding the PHY into the link setup. Now, this is quite horrible right now, because for PHYs like this, phylink just asks the MAC's validate function "give me everything you can support" when working this out - which won't be sufficient going forward. With some of the changes you've prompted - making more use of the supported_interfaces bitmap, and with further adaption of phylib to also provide that information, we can start to work out which interface modes the PHY _could_ select, and we can then query the validate() function for what is possible for each of those interface modes, and use that to bound the PHY capabilities. However, at the moment, we just don't have that information available from phylib.quoted
This is because the interface alone is insufficient to determine which PCS to select. For example, a PCS which supports full duplex but not half duplex should not be selected if the config specifies half duplex. Additionally, it should also support a selection of "no pcs".Right now, "no pcs" is really not an option I'm afraid. The presence of a PCS changes the phylink behaviour slightly . This is one of my bug-bears. The evolution of phylink has meant that we need to keep compatibility with how phylink used to work before we split the PCS support - and we detect that by whether there is a PCS to determine whether we need to operate with that compatibility. It probably was a mistake to do that in hind sight.
Of course it's an option :)
Consider the MACB driver. It has two PCSs in some configurations, and
thus is a natural target for implementing the select_pcs callback.
However, in some other configurations, it has no PCSs at all. The only
way to implement select_pcs in the current design is to have two sets of
phylink_mac_ops: one with select_pcs populated, and one with it set to
NULL.
The correct check is probably something like
if (pl->mac_ops->mac_select_pcs) {
pcs = pl->mac_ops->mac_select_pcs(pl->config, state->interface);
if (!pcs && pl->pcs)
phylink_err(pl, "mac_select_pcs unexpectedly failed\n");
else if (pcs)
phylink_set_pcs(pl, pcs);
}
quoted
If we can find a way to identify the old vs new drivers that doesn't rely on the presence of a PCS, then we should be able to fix this to allow the PCS to "vanish" in certain modes, but I do question whether there would be any realistic implementations using it. If we have a PHY connected to a serdes lane back to a set of PCS to support different protocols on the serdes, then under what scenario would we select "no pcs" - doesn't "no pcs" in that situation mean "we don't know what protocol to drive the serdes link" ?quoted
Otherwise MACs which (optionally!) have PCSs will fail to configure. We should not fail when no PCS has yet been selected or when there is no PCS at all in some hardware configuration. Further, why do we have this callback in the first place? Why don't we have drivers just do this in prepare()?I added mac_select_pcs() because finding out that something isn't supported in mac_prepare() is way too late - as explained above where I talked about binding the PHY into the link setup. E.g. if the "system" as a whole can't operate at 2.5G speeds, then we should not allow the PHY to advertise 2500BASE-T. It is no good advertising 2500BASE-T, then having the PHY negotiate 2500BASE-T, select 2500BASE-X, and then have mac_prepare() decide that can't be supported. The link won't come up, and there's nothing that can be sensibly done. The user sees the PHY indicating link, the link partner indicates link, but the link is non-functional. That isn't a good user experience. Whereas, if we know ahead of time that 2.5G can't be supported, we can remove 2500BASE-T from the advertisement, and the PHY will instead negotiate a slower speed - resulting in a working link, albiet slower.
AIUI it's a bug in the driver to advertise something in validate() which it can't support. So we don't necessarily need a separate callback.
quoted
I hope that explains why it is so important not to error out in mac_prepare() because something wasn't properly handled in the validate() step.
Specifically, "I don't need a PCS for this mode" should be a valid response. I agree that "I can't select a PCS" doesn't make sense.
What I haven't described in the above (it was rather late when I was
writing that email!) is where we need to head with a PHY that does
rate adaption - and yet again an example of such a PHY is the 88X3310.
This is actually a good example for many of the issues right now.
If the 88X3310 is configured to have a host interface that only
supports 10GBASE-R, then rate adaption within the PHY is activated,
meaning the PHY is able to operate at, e.g. 10BASE-T while the host
MAC operates at 10GBASE-R. There are some provisos to this though:
1) If the 88X3310 supports MACSEC, then it has internal MACs that
are able to generate pause frames, and pause frames will be sent
over the 10GBASE-R link as necessary to control the rate at which
the MAC sends packets.
2) If the 88X3310 does not support MACSEC, then it is expected that
the MAC is paced according to the link speed to avoid overflowing
the 88X3310 internal FIFOs (what happens when the internal FIFOs
overflow is not known.) There are no pause frames generated.
(This is the case on Macchiatobin boards if we configured the PHY
for 10GBASE-R rate-adaption mode.)Well if it just drops/corrupts packets then it just looks like a lossy/congested link. And the upper layers of the network stack already expect that sort of thing (though perhaps not optimally).
We have no "real" support for rate adaption at either phylib or phylink level - phylib has no way to tell us whether rate adaption is enabled on the PHY, nor does it have a way to tell us if we either need to pace the MAC or whether to expect pause frames from the PHY. If we have a PHY in rate adaption mode, the current behaviour will be that mac_link_up() and pcs_link_up() will be passed the negotiated media parameters as "speed", "duplex" and any flow control information, which will confuse PCS and MAC drivers at the moment, because it isn't something they expect to happen. What I mean is, if we are using PHY_INTERFACE_MODE_10GBASER, then most people will expect "speed" to be SPEED_10000, but with a rate adapting PHY it may not be. In order to properly support this, we need to update the documentation at the very least to say that what gets passed to mac_link_up() for "speed" and "duplex" are the media negotiated parameters. Then we need to have a think about how to handle flow control, and this is where extending phylib to tell us whether the PHY supports rate adaption becomes important. Flow control on the MAC needs to be enabled if (the PHY has rate adaption disabled but the media negotiated flow control) or (the PHY has rate adaption enabled _and_ the PHY is capable of issuing flow control frames - presumably the PHY will respond itself to flow control) or (the PHY has rate adaption enabled and the media negotiated flow control but the PHY is not capable of issuing flow control frames). Then there's the issue of implementing transmission pacing in any MAC driver that wants to be usable with a rate adapting PHY. Lastly, there's the issue of the "speed" and "duplex" parameters passed to pcs_link_up(), which I'm currently thinking should be the interface parameters and not the media parameters. In other words, if it's a 10GBASE-R connection between the PHY and PCS, we really should not be passing the media negotiated speed there.
Right.
So, to sum up, rate adaption isn't something that is well supported in the kernel - it's possible to bodge around phylib and phylink to make it work, but this really needs to be handled properly. Rate adaption is fairly low priority at the moment as it is in a minority, although it seems we are seeing more systems that have PHYs with this feature. So, I hope these two emails have provides some useful insights.
They have, thanks. --Sean