Re: fsl_espi errors on v5.7.15

From: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
Date: 2020-08-27 22:07:25
Also in: linux-spi, lkml

On 27/08/20 7:12 pm, Nicholas Piggin wrote:

Excerpts from Heiner Kallweit's message of August 26, 2020 4:38 pm:

quoted

On 26.08.2020 08:07, Chris Packham wrote:

quoted

On 26/08/20 1:48 pm, Chris Packham wrote:

quoted

On 26/08/20 10:22 am, Chris Packham wrote:

quoted

On 25/08/20 7:22 pm, Heiner Kallweit wrote:

<snip>

quoted

I've been staring at spi-fsl-espi.c for while now and I think I've

quoted

identified a couple of deficiencies that may or may not be related
to my
issue.

First I think the 'Transfer done but SPIE_DON isn't set' message
can be
generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
register.
We also write back to it to clear the current events. We re-read it in
fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
naturally end up in that situation if we're doing a large read.
Consider
the messages for reading a block of data from a spi-nor chip

    tx = READ_OP + ADDR
    rx = data

We setup the transfer and pump out the tx_buf. The first interrupt
goes
off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
clear ESPI_SPIE and wait for the next interrupt. The next interrupt
fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
continues until we've received all the data and we finish with
ESPI_SPIE
having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
isn't set.

The other deficiency is that we only get an interrupt when the
amount of
data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
FSL_ESPI_RXTHR left to be received we will never pull them out of
the fifo.

SPIM_DON will trigger an interrupt once the last characters have been
transferred, and read the remaining characters from the FIFO.

The T2080RM that I have says the following about the DON bit

"Last character was transmitted. The last character was transmitted
and a new command can be written for the next frame."

That does at least seem to fit with my assertion that it's all about
the TX direction. But the fact that it doesn't happen all the time
throws some doubt on it.

quoted

I think the reason I'm seeing some variability is because of how fast

quoted

(or slow) the interrupts get processed and how fast the spi-nor
chip can
fill the CPUs rx fifo.

To rule out timing issues at high bus frequencies I initially asked
for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
or even less, then timing shouldn't be an issue.

Yes I've currently got spi-max-frequency = <1000000>; in my dts. I
would also expect a slower frequency would fit my "DON is for TX"
narrative.

quoted

Last relevant functional changes have been done almost 4 years ago.
And yours is the first such report I see. So question is what could
be so
special with your setup that it seems you're the only one being
affected.
The scenarios you describe are standard, therefore much more people
should be affected in case of a driver bug.

Agreed. But even on my hardware (which may have a latent issue
despite being in the field for going on 5 years) the issue only
triggers under some fairly specific circumstances.

quoted

You said that kernel config impacts how frequently the issue happens.
Therefore question is what's the diff in kernel config, and how could
the differences be related to SPI.

It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
impact but every time I found something that seemed to be having an
impact I've been able to disprove it. I actually think its about how
busy the system is which may or may not affect when we get round to
processing the interrupts.

I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
occur on the T2080RDB.

I've had to add the following to expose the environment as a mtd
partition

diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi

b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
index ff87e67c70da..fbf95fc1fd68 100644

--- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi

@@ -116,6 +116,15 @@ flash@0 {
                                 compatible = "micron,n25q512ax3",

"jedec,spi-nor";
                                 reg = <0>;
                                 spi-max-frequency = <10000000>; /*
input clock */
+
+                               partition@u-boot {
+                                        reg = <0x00000000 0x00100000>;
+                                        label = "u-boot";
+                                };
+                                partition@u-boot-env {
+                                        reg = <0x00100000 0x00010000>;
+                                        label = "u-boot-env";
+                                };
                         };
                 };

And I'm using the following script to poke at the environment
(warning if anyone does try this and the bug hits it can render your
u-boot environment invalid).

cat flash/fw_env_test.sh
#!/bin/sh

generate_fw_env_config()
{
   cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize
name ; do
      echo "$dev $size $erasesize $name"
      [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000
$erasesize" >/flash/fw_env.config
   done
}

cycles=10
[ $# -ge 1 ] && cycles=$1

generate_fw_env_config

fw_printenv -c /flash/fw_env.config

dmesg -c >/dev/null
x=0
while [ $x -lt $cycles ]; do
     fw_printenv -c /flash/fw_env.config >/dev/null || break
     fw_setenv -c /flash/fw_env.config foo $RANDOM || break;
     dmesg -c | grep -q fsl_espi && break;
     let x=x+1
done

echo "Ran $x cycles"

I've also now seen the RX FIFO not empty error on the T2080RDB

fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32

With my current workaround of emptying the RX FIFO. It seems
survivable. Interestingly it only ever seems to be 1 extra byte in the
RX FIFO and it seems to be after either a READ_SR or a READ_FSR.

fsl_espi ffe110000.spi: tx 70
fsl_espi ffe110000.spi: rx 03
fsl_espi ffe110000.spi: Extra RX 00
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03
fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
fsl_espi ffe110000.spi: tx 05
fsl_espi ffe110000.spi: rx 00
fsl_espi ffe110000.spi: Extra RX 03

 From all the Micron SPI-NOR datasheets I've got access to it is
possible to continually read the SR/FSR. But I've no idea why it
happens some times and not others.

So I think I've got a reproduction and I think I've bisected the problem
to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
C"). My day is just finishing now so I haven't applied too much scrutiny
to this result. Given the various rabbit holes I've been down on this
issue already I'd take this information with a good degree of skepticism.

OK, so an easy test should be to re-test with a 5.4 kernel.
It doesn't have yet the change you're referring to, and the fsl-espi driver
is basically the same as in 5.7 (just two small changes in 5.7).

There's 6cc0c16d82f88 and maybe also other interrupt related patches
around this time that could affect book E, so it's good if that exact
patch is confirmed.

My confirmation is basically that I can induce the issue in a 5.4 kernel 
by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in 
5.9-rc2 by reverting that one commit.

I both cases it's not exactly a clean cherry-pick/revert so I also 
confirmed the bisection result by building at 3282a3da25bd (which sees 
the issue) and the commit just before (which does not).

quoted hunk ↗ jump to hunk

I've been staring at 3282a3da25bd for a while and nothing immediately
stands out. It doesn't look like the low level handlers do anything
special (well 0x900 does ack the decrementer, but so does the masked
handler).

Can you try this patch and also enable CONFIG_PPC_IRQ_SOFT_MASK_DEBUG?

Thanks,
Nick

---

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index bf21ebd36190..10d339042330 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c

@@ -214,7 +214,7 @@ void replay_soft_interrupts(void)
  	struct pt_regs regs;
  
  	ppc_save_regs(&regs);
-	regs.softe = IRQS_ALL_DISABLED;
+	regs.softe = IRQS_ENABLED;
  
  again:
  	if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))

@@ -349,6 +349,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
  		if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
  			WARN_ON_ONCE(!(mfmsr() & MSR_EE));
  		__hard_irq_disable();
+		local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
  	} else {
  		/*
  		 * We should already be hard disabled here. We had bugs

@@ -368,6 +369,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
  		}
  	}
  
+	preempt_disable();
  	irq_soft_mask_set(IRQS_ALL_DISABLED);
  	trace_hardirqs_off();

@@ -377,6 +379,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
  	trace_hardirqs_on();
  	irq_soft_mask_set(IRQS_ENABLED);
  	__hard_irq_enable();
+	preempt_enable();
  }
  EXPORT_SYMBOL(arch_local_irq_restore);

I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG 
didn't report anything (either with or without the change above).

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help