Thread (31 messages) 31 messages, 4 authors, 2020-09-07

Re: fsl_espi errors on v5.7.15

From: Nicholas Piggin <npiggin@gmail.com>
Date: 2020-09-01 06:14:26
Also in: linuxppc-dev, lkml

Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
On 1/09/20 12:33 am, Heiner Kallweit wrote:
quoted
On 30.08.2020 23:59, Chris Packham wrote:
quoted
On 31/08/20 9:41 am, Heiner Kallweit wrote:
quoted
On 30.08.2020 23:00, Chris Packham wrote:
quoted
On 31/08/20 12:30 am, Nicholas Piggin wrote:
quoted
Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
<snip>
quoted
quoted
quoted
quoted
quoted
quoted
I've also now seen the RX FIFO not empty error on the T2080RDB

fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32

With my current workaround of emptying the RX FIFO. It seems
survivable. Interestingly it only ever seems to be 1 extra byte in the
RX FIFO and it seems to be after either a READ_SR or a READ_FSR.

fsl_espi ffe110000.spi: tx 70
fsl_espi ffe110000.spi: rx 03
fsl_espi ffe110000.spi: Extra RX 00
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03

    From all the Micron SPI-NOR datasheets I've got access to it is
possible to continually read the SR/FSR. But I've no idea why it
happens some times and not others.
So I think I've got a reproduction and I think I've bisected the problem
to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
C"). My day is just finishing now so I haven't applied too much scrutiny
to this result. Given the various rabbit holes I've been down on this
issue already I'd take this information with a good degree of skepticism.
OK, so an easy test should be to re-test with a 5.4 kernel.
It doesn't have yet the change you're referring to, and the fsl-espi driver
is basically the same as in 5.7 (just two small changes in 5.7).
There's 6cc0c16d82f88 and maybe also other interrupt related patches
around this time that could affect book E, so it's good if that exact
patch is confirmed.
My confirmation is basically that I can induce the issue in a 5.4 kernel
by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
5.9-rc2 by reverting that one commit.

I both cases it's not exactly a clean cherry-pick/revert so I also
confirmed the bisection result by building at 3282a3da25bd (which sees
the issue) and the commit just before (which does not).
Thanks for testing, that confirms it well.

[snip patch]
quoted
I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
didn't report anything (either with or without the change above).
Okay, it was a bit of a shot in the dark. I still can't see what
else has changed.

What would cause this, a lost interrupt? A spurious interrupt? Or
higher interrupt latency?

I don't think the patch should cause significantly worse latency,
(it's supposed to be a bit better if anything because it doesn't set
up the full interrupt frame). But it's possible.
My working theory is that the SPI_DON indication is all about the TX
direction an now that the interrupts are faster we're hitting an error
because there is still RX activity going on. Heiner disagrees with my
interpretation of the SPI_DON indication and the fact that it doesn't
happen every time does throw doubt on it.
It's right that the eSPI spec can be interpreted that SPI_DON refers to
TX only. However this wouldn't really make sense, because also for RX
we program the frame length, and therefore want to be notified once the
full frame was received. Also practical experience shows that SPI_DON
is set also after RX-only transfers.
Typical SPI NOR use case is that you write read command + start address,
followed by a longer read. If the TX-only interpretation would be right,
we'd always end up with SPI_DON not being set.
quoted
I can't really explain the extra RX byte in the fifo. We know how many
bytes to expect and we pull that many from the fifo so it's not as if
we're missing an interrupt causing us to skip the last byte. I've been
looking for some kind of off-by-one calculation but again if it were
something like that it'd happen all the time.
Maybe it helps to know what value this extra byte in the FIFO has. Is it:
- a duplicate of the last read byte
- or the next byte (at <end address> + 1)
- or a fixed value, e.g. always 0x00 or 0xff
The values were up thread a bit but I'll repeat them here

fsl_espi ffe110000.spi: tx 70
fsl_espi ffe110000.spi: rx 03
fsl_espi ffe110000.spi: Extra RX 00
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03


The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
either a READ_SR or a READ_FSR. Never a data read.
Just remembered something about SPIE_DON:
Transfers are always full duplex, therefore in case of a read the chip
sends dummy zero's. Having said that in case of a read SPIE_DON means
that the last dummy zero was shifted out.

READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
So the issue may have a dependency on the length of the transfer.
However I see no good explanation so far. You can try adding a delay of
a few miroseconds between the following to commands in fsl_espi_bufs().

	fsl_espi_write_reg(espi, ESPI_SPIM, mask);

	/* Prevent filling the fifo from getting interrupted */
	spin_lock_irq(&espi->lock);

Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
I think this might be heading in the right direction. Playing about with 
a delay does seem to make the two symptoms less likely. Although I have 
to set it quite high (i.e. msleep(100)) to completely avoid any 
possibility of seeing either message.
The patch might replay the interrupt a little bit faster, but it would 
be a few microseconds at most I think (just from improved code).

Would you be able to ftrace the interrupt handler function and see if you
can see a difference in number or timing of interrupts? I'm at a bit of 
a loss.

Thanks,
Nick
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help