Re: MCE handler gets NIP wrong on MPC8378

From: Radu Rendec <hidden>
Date: 2020-02-20 17:36:51

On 02/20/2020 at 11:25 AM Christophe Leroy [off-list ref] wrote:

Le 20/02/2020 à 17:02, Radu Rendec a écrit :

quoted

On 02/20/2020 at 3:38 AM Christophe Leroy [off-list ref] wrote:

quoted

On 02/19/2020 10:39 PM, Radu Rendec wrote:

quoted

On 02/19/2020 at 4:21 PM Christophe Leroy [off-list ref] wrote:

quoted

Interesting.

0x900 is the adress of the timer interrupt.

Would the MCE occur just after the timer interrupt ?

I doubt that. I'm using a small test module to artificially trigger the
MCE. Basically it's just this (the full code is in my original post):

          bad_addr_base = ioremap(0xf0000000, 0x100);
          x = ioread32(bad_addr_base);

I find it hard to believe that every time I load the module the lwbrx
instruction that triggers the MCE is executed exactly after the timer
interrupt (or that the timer interrupt always occurs close to the lwbrx
instruction).

Can you try to see how much time there is between your read and the MCE ?
The below should allow it, you'll see first value in r13 and the other
in r14 (mce.c is your test code)

Also provide the timebase frequency as reported in /proc/cpuinfo

I just ran a test: r13 is 0xda8e0f91 and r14 is 0xdaae0f9c.

# cat /proc/cpuinfo
processor       : 0
cpu             : e300c4
clock           : 800.000004MHz
revision        : 1.1 (pvr 8086 1011)
bogomips        : 200.00
timebase        : 100000000

The difference between r14 and r13 is 0x20000b. Assuming TB is
incremented with 'timebase' frequency, that means 20.97 milliseconds
(although the e300 manual says TB is "incremented once every four core
input clock cycles").

I wouldn't be surprised that the internal CPU clock be twice the input
clock.

So that's long enough to surely get a timer interrupt during every bad
access.

Now we have to understand why SRR1 contains the address of the timer
exception entry and not the address of the bad access.

The value of SRR1 confirms that it comes from 0x900 as MSR[IR] and [DR]
are cleared when interrupts are enabled.

Maybe you should file a support case at NXP. They are usually quite
professionnal at responding.

I already did (quite some time ago), but it started off as "why does the
MCE occur in the first place". That part has already been figured out,
but unfortunately I don't have a viable solution to it. Like you said,
now the focus has shifted to understanding why the SRR0 value is not
what we expect.

I asked them the question about SRR0 as soon as you helped me get back
on track and figured out there's nothing wrong with the Linux MCE
handler and the NIP value comes from SRR0. What they came up with is
basically this paragraph in the e300 core manual (section 5.5.2):

| Note that the e300 core makes no attempt to force recoverability on a
| machine check; however, it does guarantee that the machine check
| interrupt is always taken immediately upon request, with a nonpredicted
| address saved in SRR0, regardless of the current machine state.

... and with an emphasis on "nonpredicted". To be honest, I am a bit
disappointed with their response and I believe in this context what
"unpredicted" means is that the address that is saved to SRR0 is a
"real" address rather than the result of branch prediction. The support
folks were probably thinking "unpredictable". But that's another word
and the difference is quite subtle :)

I updated the case and added information about the interrupts and the
timing. Let's see what they come up with this time.

Best regards,
Radu

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help