Re: [PATCH] mtd: spi-nor: only apply reset hacks to broken hardware

From: NeilBrown <hidden>
Date: 2018-08-01 00:38:25

On Tue, Jul 31 2018, Boris Brezillon wrote:

On Tue, 31 Jul 2018 11:05:11 +1000
NeilBrown [off-list ref] wrote:

quoted

On Fri, Jul 27 2018, Boris Brezillon wrote:

quoted

On Fri, 27 Jul 2018 11:33:13 -0700
Brian Norris [off-list ref] wrote:

quoted

Commit 59b356ffd0b0 ("mtd: m25p80: restore the status of SPI flash when
exiting") is the latest from a long history of attempts to add reboot
handling to handle stateful addressing modes on SPI flash. Some prior
mostly-related discussions:

http://lists.infradead.org/pipermail/linux-mtd/2013-March/046343.html
[PATCH 1/3] mtd: m25p80: utilize dedicated 4-byte addressing commands

http://lists.infradead.org/pipermail/barebox/2014-September/020682.html
[RFC] MTD m25p80 3-byte addressing and boot problem

http://lists.infradead.org/pipermail/linux-mtd/2015-February/057683.html
[PATCH 2/2] m25p80: if supported put chip to deep power down if not used

Previously, attempts to add reboot-time software reset handling were
rejected, but the latest attempt was not.

Quick summary of the problem:
Some systems (e.g., boot ROM or bootloader) assume that they can read
initial boot code from their SPI flash using 3-byte addressing. If the
flash is left in 4-byte mode after reset, these systems won't boot. The
above patch provided a shutdown/remove hook to attempt to reset the
addressing mode before we reboot. Notably, this patch misses out on
huge classes of unexpected reboots (e.g., crashes, watchdog resets).

Unfortunately, it is essentially impossible to solve this problem 100%:
if your system doesn't know how to reset the SPI flash to power-on
defaults at initialization time, no amount of software can really rescue
you -- there will always be a chance of some unexpected reset that
leaves your flash in an addressing mode that your boot sequence didn't
expect.

While it is not directly harmful to perform hacks like the
aforementioned commit on all 4-byte addressing flash, a
properly-designed system should not need the hack -- and in fact,
providing this hack may mask the fact that a given system is indeed
broken. So this patch attempts to apply this unsound hack more narrowly,
providing a strong suggestion to developers and system designers that
this is truly a hack. With luck, system designers can catch their errors
early on in their development cycle, rather than applying this hack long
term. But apparently enough systems are out in the wild that we still
have to provide this hack.

Document a new device tree property to denote systems that do not have a
proper hardware (or software) reset mechanism, and apply the hack (with
a loud warning) only in this case.

Signed-off-by: Brian Norris <computersforpeace@gmail.com>
---
Note that I intentionall didn't split the documentation patch. It seems
clearer to do these together IMO, but if it's *really* important to
someone...I can resend

I'm fine with that.

I'll leave Neil some time to review/test/comment on the patch before
queuing it, but it looks good to me.

Thanks.
I can confirm that if I apply this patch, my system won't reboot
properly (as expected), and if I then add

		broken-flash-reset;

to the jedec,spi-nor device, it starts functioning correctly again.

I don't like the pejorative "broken", and it also suggests that a thing
used to work, but something happened to break it - this is not
accurate.
I would prefer something like "reset-not-connected" which is an accurate
description of the state of the hardware.

I also think that having a WARN_ON is an over-reaction.  Certainly a
warning could be appropriate, but just one pr_warn() should be enough.
The "problem" is unlikely in practice, and loudly warning people that an
asteroid might kill them isn't particularly helpful.

I genuinely think that if the system fails to reboot, then Linux is at
fault. I accept that changing Linux to be completely robust might be
more trouble than it is worth, but I don't accept that it is impossible.

But I don't intend to fight either of these battles.

Does that mean you're accepting this change? Brian, any comment on what
Neil said?

I don't see that it is my place to accept or reject the change.
I don't particularly like it, but I hope to never look at this code
against so you shouldn't put to much weight on what I like.

To be honest, I hate being in the middle of this discussion without
having been involved in the first decision to accept such workarounds.
I keep thinking that making boards that do not have reset properly
wired less likely to fail rebooting is a wise decision, but I also
agree with Brian when he says we should inform people that their design
is unreliable.
The main problem I see here, is that adding this prop won't help people
figuring out what is wrong with their design, it will just help them
workaround the problem when they find out, and it might already be to
late to fix the HW design. But maybe it's not what we're trying to do
here. Maybe we just want to warn users that rebooting such boards is a
risky procedure.

Simply rebooting the board is not a risky procedure.
The risk is that if something causes Linux to "crash", it may not reboot
properly.

Thanks,
NeilBrown

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help