Re: fsl_espi errors on v5.7.15

From: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
Date: 2020-09-06 23:22:37
Subsystem: spi subsystem, the rest · Maintainers: Mark Brown, Linus Torvalds

On 5/09/20 5:23 am, Heiner Kallweit wrote:


On Fri 4. Sep 2020 at 01:58, Chris Packham <Chris.Packham@alliedtelesis.co.nz<mailto:Chris.Packham@alliedtelesis.co.nz>> wrote:


On 1/09/20 6:14 pm, Nicholas Piggin wrote:

Excerpts from Chris Packham's message of September 1, 2020 11:25 am:

quoted

On 1/09/20 12:33 am, Heiner Kallweit wrote:

quoted

On 30.08.2020 23:59, Chris Packham wrote:

quoted

On 31/08/20 9:41 am, Heiner Kallweit wrote:

quoted
quoted
quoted
quoted
On 30.08.2020 23:00, Chris Packham wrote:

quoted
quoted
quoted
quoted
quoted
On 31/08/20 12:30 am, Nicholas Piggin wrote:

quoted
quoted
quoted
quoted
quoted
quoted
Excerpts from Chris Packham's message of August 28, 2020 8:07 am:

quoted
quoted
quoted
quoted
quoted
<snip>

quoted
quoted
quoted
quoted
quoted

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
I've also now seen the RX FIFO not empty error on the T2080RDB

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
With my current workaround of emptying the RX FIFO. It seems

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
survivable. Interestingly it only ever seems to be 1 extra byte in the

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
RX FIFO and it seems to be after either a READ_SR or a READ_FSR.

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: tx 70

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: rx 03

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Extra RX 00

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: tx 05

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: rx 00

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Extra RX 03

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: tx 05

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: rx 00

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
fsl_espi ffe110000.spi: Extra RX 03

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
     From all the Micron SPI-NOR datasheets I've got access to it is

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
possible to continually read the SR/FSR. But I've no idea why it

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
happens some times and not others.

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
So I think I've got a reproduction and I think I've bisected the problem

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
C"). My day is just finishing now so I haven't applied too much scrutiny

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
to this result. Given the various rabbit holes I've been down on this

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
issue already I'd take this information with a good degree of skepticism.

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
OK, so an easy test should be to re-test with a 5.4 kernel.

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
It doesn't have yet the change you're referring to, and the fsl-espi driver

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
is basically the same as in 5.7 (just two small changes in 5.7).

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
There's 6cc0c16d82f88 and maybe also other interrupt related patches

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
around this time that could affect book E, so it's good if that exact

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
patch is confirmed.

quoted
quoted
quoted
quoted
quoted
quoted
quoted
My confirmation is basically that I can induce the issue in a 5.4 kernel

quoted
quoted
quoted
quoted
quoted
quoted
quoted
by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in

quoted
quoted
quoted
quoted
quoted
quoted
quoted
5.9-rc2 by reverting that one commit.

quoted
quoted
quoted
quoted
quoted
quoted
quoted

quoted
quoted
quoted
quoted
quoted
quoted
quoted
I both cases it's not exactly a clean cherry-pick/revert so I also

quoted
quoted
quoted
quoted
quoted
quoted
quoted
confirmed the bisection result by building at 3282a3da25bd (which sees

quoted
quoted
quoted
quoted
quoted
quoted
quoted
the issue) and the commit just before (which does not).

quoted
quoted
quoted
quoted
quoted
quoted
Thanks for testing, that confirms it well.

quoted
quoted
quoted
quoted
quoted
quoted

quoted
quoted
quoted
quoted
quoted
quoted
[snip patch]

quoted
quoted
quoted
quoted
quoted
quoted

quoted
quoted
quoted
quoted
quoted
quoted
quoted
I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG

quoted
quoted
quoted
quoted
quoted
quoted
quoted
didn't report anything (either with or without the change above).

quoted
quoted
quoted
quoted
quoted
quoted
Okay, it was a bit of a shot in the dark. I still can't see what

quoted
quoted
quoted
quoted
quoted
quoted
else has changed.

quoted
quoted
quoted
quoted
quoted
quoted

quoted
quoted
quoted
quoted
quoted
quoted
What would cause this, a lost interrupt? A spurious interrupt? Or

quoted
quoted
quoted
quoted
quoted
quoted
higher interrupt latency?

quoted
quoted
quoted
quoted
quoted
quoted

quoted
quoted
quoted
quoted
quoted
quoted
I don't think the patch should cause significantly worse latency,

quoted
quoted
quoted
quoted
quoted
quoted
(it's supposed to be a bit better if anything because it doesn't set

quoted
quoted
quoted
quoted
quoted
quoted
up the full interrupt frame). But it's possible.

quoted
quoted
quoted
quoted
quoted
My working theory is that the SPI_DON indication is all about the TX

quoted

direction an now that the interrupts are faster we're hitting an error

quoted
quoted
quoted
quoted
quoted
because there is still RX activity going on. Heiner disagrees with my

quoted
quoted
quoted
quoted
quoted
interpretation of the SPI_DON indication and the fact that it doesn't

quoted
quoted
quoted
quoted
quoted
happen every time does throw doubt on it.

quoted
quoted
quoted
quoted
quoted

quoted

It's right that the eSPI spec can be interpreted that SPI_DON refers to

quoted

TX only. However this wouldn't really make sense, because also for RX

quoted

we program the frame length, and therefore want to be notified once the

quoted

full frame was received. Also practical experience shows that SPI_DON

quoted
quoted
quoted
quoted
is set also after RX-only transfers.

quoted

Typical SPI NOR use case is that you write read command + start address,

quoted

followed by a longer read. If the TX-only interpretation would be right,

quoted
quoted
quoted
quoted
we'd always end up with SPI_DON not being set.

quoted
quoted
quoted
quoted

quoted

I can't really explain the extra RX byte in the fifo. We know how many

quoted
quoted
quoted
quoted
quoted
bytes to expect and we pull that many from the fifo so it's not as if

quoted

we're missing an interrupt causing us to skip the last byte. I've been

quoted
quoted
quoted
quoted
quoted
looking for some kind of off-by-one calculation but again if it were

quoted
quoted
quoted
quoted
quoted
something like that it'd happen all the time.

quoted
quoted
quoted
quoted
quoted

quoted

Maybe it helps to know what value this extra byte in the FIFO has. Is it:

quoted
quoted
quoted
quoted
- a duplicate of the last read byte

quoted
quoted
quoted
quoted
- or the next byte (at <end address> + 1)

quoted
quoted
quoted
quoted
- or a fixed value, e.g. always 0x00 or 0xff

quoted

The values were up thread a bit but I'll repeat them here

quoted
quoted
quoted

quoted
quoted
quoted
fsl_espi ffe110000.spi: tx 70

quoted
quoted
quoted
fsl_espi ffe110000.spi: rx 03

quoted
quoted
quoted
fsl_espi ffe110000.spi: Extra RX 00

quoted

fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!

quoted

fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!

quoted

fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32

quoted
quoted
quoted
fsl_espi ffe110000.spi: tx 05

quoted
quoted
quoted
fsl_espi ffe110000.spi: rx 00

quoted
quoted
quoted
fsl_espi ffe110000.spi: Extra RX 03

quoted

fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!

quoted

fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!

quoted

fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32

quoted
quoted
quoted
fsl_espi ffe110000.spi: tx 05

quoted
quoted
quoted
fsl_espi ffe110000.spi: rx 00

quoted
quoted
quoted
fsl_espi ffe110000.spi: Extra RX 03

quoted
quoted
quoted

quoted

The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with

quoted

either a READ_SR or a READ_FSR. Never a data read.

quoted
quoted
quoted

quoted

Just remembered something about SPIE_DON:

quoted

Transfers are always full duplex, therefore in case of a read the chip

quoted

sends dummy zero's. Having said that in case of a read SPIE_DON means

quoted

that the last dummy zero was shifted out.

quoted
quoted

quoted

READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.

quoted

So the issue may have a dependency on the length of the transfer.

quoted

However I see no good explanation so far. You can try adding a delay of

quoted

a few miroseconds between the following to commands in fsl_espi_bufs().

quoted
quoted

quoted

    fsl_espi_write_reg(espi, ESPI_SPIM, mask);

quoted
quoted

quoted

    /* Prevent filling the fifo from getting interrupted */

quoted

    spin_lock_irq(&espi->lock);

quoted
quoted

quoted

Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.

quoted

I think this might be heading in the right direction. Playing about with

quoted

a delay does seem to make the two symptoms less likely. Although I have

quoted

to set it quite high (i.e. msleep(100)) to completely avoid any

quoted

possibility of seeing either message.

The patch might replay the interrupt a little bit faster, but it would

be a few microseconds at most I think (just from improved code).

Would you be able to ftrace the interrupt handler function and see if you

can see a difference in number or timing of interrupts? I'm at a bit of

a loss.



I tried ftrace but I really wasn't sure what I was looking for.

Capturing a "bad" case was pretty tricky. But I think I've identified a

fix (I'll send it as a proper patch shortly). The gist is

diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c

index 7e7c92cafdbb..cb120b68c0e2 100644

--- a/drivers/spi/spi-fsl-espi.c

+++ b/drivers/spi/spi-fsl-espi.c

@@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi

*espi, u32 events)

  static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)

  {

         struct fsl_espi *espi = context_data;

-       u32 events;

+       u32 events, mask;



         spin_lock(&espi->lock);



         /* Get interrupt events(tx/rx) */

         events = fsl_espi_read_reg(espi, ESPI_SPIE);

-       if (!events) {

+       mask = fsl_espi_read_reg(espi, ESPI_SPIM);

+       if (!(events & mask)) {

                 spin_unlock(&espi->lock);

                 return IRQ_NONE;

         }



The SPIE register contains the TXCNT so events is pretty much always

going to have something set. By checking events against what we've

actually requested interrupts for we don't see any spurious events.


Usually we shouldn’t receive interrupts we’re not interested in, except the interrupt is shared.
My theory is that we get an interrupt to the core for RXT and another for DON. With the changes to the powerpc interrupt handling and the fact that fsl_espi_cpu_irq() doesn't actually look at the specific event bits means that sometimes both events are handled in the processing of the first interrupt and the second one trips the SPI_DON not set error.

There's an old comment "SPI bus sometimes got lost interrupts..." which makes me wonder if the interrupt handling change has fixed this original problem.

I still think the change makes sense regardless because the SPIE register has some counter fields in it.

This leads to the question: is the SPI interrupt shared with another device on your system?
It's not shared on either the custom board or the T2080RDB.
Do you see spurious interrupts with the patch under /proc/irq/(irq)/spurious?

Yes it looks like it

[root@linuxbox ~]# cat /proc/irq/53/spurious
count 3126
unhandled 0
last_unhandled 0 ms

[root@linuxbox ~]# /flash/dd_test.sh

[root@linuxbox ~]# cat /proc/irq/53/spurious
count 1016
unhandled 0
last_unhandled 4294746100 ms

[root@linuxbox ~]# /flash/dd_test.sh

[root@linuxbox ~]# cat /proc/irq/53/spurious
count 88391
unhandled 0
last_unhandled 4294746100 ms

[root@linuxbox ~]# /flash/dd_test.sh

[root@linuxbox ~]# cat /proc/irq/53/spurious
count 72459
unhandled 2
last_unhandled 4294758632 ms



I've tested this on the T2080RDB and on our custom hardware and it seems

to resolve the problem.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help