Re: fsl_espi errors on v5.7.15
From: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
Date: 2020-09-06 23:22:37
Subsystem:
spi subsystem, the rest · Maintainers:
Mark Brown, Linus Torvalds
On 5/09/20 5:23 am, Heiner Kallweit wrote: On Fri 4. Sep 2020 at 01:58, Chris Packham <Chris.Packham@alliedtelesis.co.nz<mailto:Chris.Packham@alliedtelesis.co.nz>> wrote: On 1/09/20 6:14 pm, Nicholas Piggin wrote:
Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
quoted
On 1/09/20 12:33 am, Heiner Kallweit wrote:
quoted
quoted
On 30.08.2020 23:59, Chris Packham wrote:
quoted
quoted
quoted
On 31/08/20 9:41 am, Heiner Kallweit wrote:
quoted
quoted
quoted
quoted
On 30.08.2020 23:00, Chris Packham wrote:
quoted
quoted
quoted
quoted
quoted
On 31/08/20 12:30 am, Nicholas Piggin wrote:
quoted
quoted
quoted
quoted
quoted
quoted
Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
quoted
quoted
quoted
quoted
quoted
<snip>
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
I've also now seen the RX FIFO not empty error on the T2080RDB
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
With my current workaround of emptying the RX FIFO. It seems
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
survivable. Interestingly it only ever seems to be 1 extra byte in the
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: tx 70
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: rx 03
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Extra RX 00
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: tx 05
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: rx 00
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Extra RX 03
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: tx 05
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: rx 00
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Extra RX 03
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
From all the Micron SPI-NOR datasheets I've got access to it is
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
possible to continually read the SR/FSR. But I've no idea why it
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
happens some times and not others.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
So I think I've got a reproduction and I think I've bisected the problem
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
C"). My day is just finishing now so I haven't applied too much scrutiny
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
to this result. Given the various rabbit holes I've been down on this
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
issue already I'd take this information with a good degree of skepticism.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
OK, so an easy test should be to re-test with a 5.4 kernel.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
It doesn't have yet the change you're referring to, and the fsl-espi driver
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
is basically the same as in 5.7 (just two small changes in 5.7).
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
There's 6cc0c16d82f88 and maybe also other interrupt related patches
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
around this time that could affect book E, so it's good if that exact
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
patch is confirmed.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
My confirmation is basically that I can induce the issue in a 5.4 kernel
quoted
quoted
quoted
quoted
quoted
quoted
quoted
by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
quoted
quoted
quoted
quoted
quoted
quoted
quoted
5.9-rc2 by reverting that one commit.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
I both cases it's not exactly a clean cherry-pick/revert so I also
quoted
quoted
quoted
quoted
quoted
quoted
quoted
confirmed the bisection result by building at 3282a3da25bd (which sees
quoted
quoted
quoted
quoted
quoted
quoted
quoted
the issue) and the commit just before (which does not).
quoted
quoted
quoted
quoted
quoted
quoted
Thanks for testing, that confirms it well.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[snip patch]
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
quoted
quoted
quoted
quoted
quoted
quoted
quoted
didn't report anything (either with or without the change above).
quoted
quoted
quoted
quoted
quoted
quoted
Okay, it was a bit of a shot in the dark. I still can't see what
quoted
quoted
quoted
quoted
quoted
quoted
else has changed.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
What would cause this, a lost interrupt? A spurious interrupt? Or
quoted
quoted
quoted
quoted
quoted
quoted
higher interrupt latency?
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
I don't think the patch should cause significantly worse latency,
quoted
quoted
quoted
quoted
quoted
quoted
(it's supposed to be a bit better if anything because it doesn't set
quoted
quoted
quoted
quoted
quoted
quoted
up the full interrupt frame). But it's possible.
quoted
quoted
quoted
quoted
quoted
My working theory is that the SPI_DON indication is all about the TX
quoted
quoted
quoted
quoted
quoted
direction an now that the interrupts are faster we're hitting an error
quoted
quoted
quoted
quoted
quoted
because there is still RX activity going on. Heiner disagrees with my
quoted
quoted
quoted
quoted
quoted
interpretation of the SPI_DON indication and the fact that it doesn't
quoted
quoted
quoted
quoted
quoted
happen every time does throw doubt on it.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
It's right that the eSPI spec can be interpreted that SPI_DON refers to
quoted
quoted
quoted
quoted
TX only. However this wouldn't really make sense, because also for RX
quoted
quoted
quoted
quoted
we program the frame length, and therefore want to be notified once the
quoted
quoted
quoted
quoted
full frame was received. Also practical experience shows that SPI_DON
quoted
quoted
quoted
quoted
is set also after RX-only transfers.
quoted
quoted
quoted
quoted
Typical SPI NOR use case is that you write read command + start address,
quoted
quoted
quoted
quoted
followed by a longer read. If the TX-only interpretation would be right,
quoted
quoted
quoted
quoted
we'd always end up with SPI_DON not being set.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
I can't really explain the extra RX byte in the fifo. We know how many
quoted
quoted
quoted
quoted
quoted
bytes to expect and we pull that many from the fifo so it's not as if
quoted
quoted
quoted
quoted
quoted
we're missing an interrupt causing us to skip the last byte. I've been
quoted
quoted
quoted
quoted
quoted
looking for some kind of off-by-one calculation but again if it were
quoted
quoted
quoted
quoted
quoted
something like that it'd happen all the time.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
Maybe it helps to know what value this extra byte in the FIFO has. Is it:
quoted
quoted
quoted
quoted
- a duplicate of the last read byte
quoted
quoted
quoted
quoted
- or the next byte (at <end address> + 1)
quoted
quoted
quoted
quoted
- or a fixed value, e.g. always 0x00 or 0xff
quoted
quoted
quoted
The values were up thread a bit but I'll repeat them here
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: tx 70
quoted
quoted
quoted
fsl_espi ffe110000.spi: rx 03
quoted
quoted
quoted
fsl_espi ffe110000.spi: Extra RX 00
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
quoted
quoted
quoted
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
quoted
quoted
quoted
fsl_espi ffe110000.spi: tx 05
quoted
quoted
quoted
fsl_espi ffe110000.spi: rx 00
quoted
quoted
quoted
fsl_espi ffe110000.spi: Extra RX 03
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
quoted
quoted
quoted
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
quoted
quoted
quoted
fsl_espi ffe110000.spi: tx 05
quoted
quoted
quoted
fsl_espi ffe110000.spi: rx 00
quoted
quoted
quoted
fsl_espi ffe110000.spi: Extra RX 03
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
quoted
quoted
quoted
either a READ_SR or a READ_FSR. Never a data read.
quoted
quoted
quoted
quoted
quoted
Just remembered something about SPIE_DON:
quoted
quoted
Transfers are always full duplex, therefore in case of a read the chip
quoted
quoted
sends dummy zero's. Having said that in case of a read SPIE_DON means
quoted
quoted
that the last dummy zero was shifted out.
quoted
quoted
quoted
quoted
READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
quoted
quoted
So the issue may have a dependency on the length of the transfer.
quoted
quoted
However I see no good explanation so far. You can try adding a delay of
quoted
quoted
a few miroseconds between the following to commands in fsl_espi_bufs().
quoted
quoted
quoted
quoted
fsl_espi_write_reg(espi, ESPI_SPIM, mask);
quoted
quoted
quoted
quoted
/* Prevent filling the fifo from getting interrupted */
quoted
quoted
spin_lock_irq(&espi->lock);
quoted
quoted
quoted
quoted
Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
quoted
I think this might be heading in the right direction. Playing about with
quoted
a delay does seem to make the two symptoms less likely. Although I have
quoted
to set it quite high (i.e. msleep(100)) to completely avoid any
quoted
possibility of seeing either message.
The patch might replay the interrupt a little bit faster, but it would
be a few microseconds at most I think (just from improved code).
Would you be able to ftrace the interrupt handler function and see if you
can see a difference in number or timing of interrupts? I'm at a bit of
a loss.
I tried ftrace but I really wasn't sure what I was looking for. Capturing a "bad" case was pretty tricky. But I think I've identified a fix (I'll send it as a proper patch shortly). The gist is
diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
index 7e7c92cafdbb..cb120b68c0e2 100644
--- a/drivers/spi/spi-fsl-espi.c
+++ b/drivers/spi/spi-fsl-espi.c
@@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi
*espi, u32 events)
static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
{
struct fsl_espi *espi = context_data;
- u32 events;
+ u32 events, mask;
spin_lock(&espi->lock);
/* Get interrupt events(tx/rx) */
events = fsl_espi_read_reg(espi, ESPI_SPIE);
- if (!events) {
+ mask = fsl_espi_read_reg(espi, ESPI_SPIM);
+ if (!(events & mask)) {
spin_unlock(&espi->lock);
return IRQ_NONE;
}
The SPIE register contains the TXCNT so events is pretty much always
going to have something set. By checking events against what we've
actually requested interrupts for we don't see any spurious events.
Usually we shouldn’t receive interrupts we’re not interested in, except the interrupt is shared.
My theory is that we get an interrupt to the core for RXT and another for DON. With the changes to the powerpc interrupt handling and the fact that fsl_espi_cpu_irq() doesn't actually look at the specific event bits means that sometimes both events are handled in the processing of the first interrupt and the second one trips the SPI_DON not set error.
There's an old comment "SPI bus sometimes got lost interrupts..." which makes me wonder if the interrupt handling change has fixed this original problem.
I still think the change makes sense regardless because the SPIE register has some counter fields in it.
This leads to the question: is the SPI interrupt shared with another device on your system?
It's not shared on either the custom board or the T2080RDB.
Do you see spurious interrupts with the patch under /proc/irq/(irq)/spurious?
Yes it looks like it
[root@linuxbox ~]# cat /proc/irq/53/spurious
count 3126
unhandled 0
last_unhandled 0 ms
[root@linuxbox ~]# /flash/dd_test.sh
[root@linuxbox ~]# cat /proc/irq/53/spurious
count 1016
unhandled 0
last_unhandled 4294746100 ms
[root@linuxbox ~]# /flash/dd_test.sh
[root@linuxbox ~]# cat /proc/irq/53/spurious
count 88391
unhandled 0
last_unhandled 4294746100 ms
[root@linuxbox ~]# /flash/dd_test.sh
[root@linuxbox ~]# cat /proc/irq/53/spurious
count 72459
unhandled 2
last_unhandled 4294758632 ms
I've tested this on the T2080RDB and on our custom hardware and it seems
to resolve the problem.