Thread (51 messages) 51 messages, 10 authors, 2019-06-19

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

From: Borislav Petkov <bp@alien8.de>
Date: 2019-06-12 12:25:15
Also in: linux-edac, lkml

On Wed, Jun 12, 2019 at 09:57:40PM +1000, Benjamin Herrenschmidt wrote:
On Wed, 2019-06-12 at 08:42 -0300, Mauro Carvalho Chehab wrote:
quoted
quoted
Yes, we do have different error reporting facilities but I still
think
that concentrating all the error information needed in order to do
proper recovery action is the better approach here. And make that
part
of the kernel so that it is robust. Userspace can still configure
it and
so on.
If the error reporting facilities are for the same hardware "group"
(like the machine's memory controllers), I agree with you: it makes
sense to have a single driver. 

If they are for completely independent hardware then implementing
as separate drivers would work equally well, with the advantage of
making easier to maintain and make it generic enough to support
different vendors using the same IP block.
Right. And if you really want a platform orchestrator for recovery in
the kenrel, it should be a separate one, that consumes data from the
individual IP block drivers that report the raw errors anyway.
Yap, I think we're in agreement here. I believe the important question
is whether you need to get error information from multiple sources
together in order to do proper recovery or doing it per error source
suffices.

And I think the actual use cases could/should dictate our
drivers/orchestrators design.

Thus my question how you guys are planning on tying all that error info
the drivers report, into the whole system design?
But for the main case that really needs to be in the kernel, which is
DRAM, the recovery can usually be contained to the MC driver anyway.
Right, if that is enough to handle the error properly.

The memory-failure.c example I gave before is the error reporting
mechanism (x86 MCA) calling into the mm subsystem to poison and isolate
page frames which are known to contain errors. So you have two things
talking to each other.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help