RE: [PATCH V4] powerpc/85xx: Add machine check handler to fix PCIe erratum on mpc85xx
From: Jia Hongtao-B38951 <hidden>
Date: 2013-03-06 08:29:06
-----Original Message----- From: Wood Scott-B07421 Sent: Wednesday, March 06, 2013 2:48 AM To: Jia Hongtao-B38951 Cc: Wood Scott-B07421; Stuart Yoder; linuxppc-dev@lists.ozlabs.org; Kumar Gala Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler to fix PCIe erratum on mpc85xx =20 On 03/05/2013 04:12:30 AM, Jia Hongtao-B38951 wrote:quoted
quoted
-----Original Message----- From: Wood Scott-B07421 Sent: Tuesday, March 05, 2013 7:46 AM To: Stuart Yoder Cc: Jia Hongtao-B38951; linuxppc-dev@lists.ozlabs.org; Kumar Gala Subject: Re: [PATCH V4] powerpc/85xx: Add machine check handler tofixquoted
PCIe erratum on mpc85xx On 03/04/2013 10:16:10 AM, Stuart Yoder wrote:quoted
On Mon, Mar 4, 2013 at 2:40 AM, Jia Hongtao [off-list ref] wrote:quoted
A PCIe erratum of mpc85xx may causes a core hang when a link ofPCIequoted
quoted
quoted
goes down. when the link goes down, Non-posted transactionsissuedquoted
quoted
quoted
via the ATMU requiring completion result in an instructionstall.quoted
quoted
quoted
At the same time a machine-check exception is generated to thecorequoted
quoted
quoted
to allow further processing by the handler. We implements thehandlerquoted
which skips the instruction caused the stall.Can you explain at a high level how just skipping an instruction solves anything? If you just skip a load/store and continue likenothing isquoted
quoted
wrong, isn't your system possibly in a really bad state.If the instruction was a load, we probably at least want to fill the destination register with 0xffffffff or similar.You discuss this with Liu Shuo about a year ago. here is the log: " On 02/01/2012 02:18 AM, shuo.liu@freescale.com wrote:quoted
v3 : Skip the instruction only. Don't access the user space memoryinquoted
mechine check.It may be the least bad option for now, but be aware that there's a small chance that this will cause a leak of sensitive information (such as a piece of a crypto key that happened to be sitting in the register to be loaded into).=20 Yes, that's (one reason) why you'd want to fill in a known value. Note the "for now". :-) =20 -Scott
I think there is no overwhelming reason to fill the destination register with 0xffffffff.=20 There's a small chance that 0xffffffff is treated as regular data rather than an error sign. Also setting this register may influence the user space under certain circumstance. So I think just ignore the skipped instruction is an acceptable option for this fix. -Hongtao.