Re: [general question] rare silent data corruption when writing data

[general question] rare silent data corruption when writing data · Michal Soltys <hidden> · 2020-05-07
Re: [general question] rare silent data corruption when writing data · Roger Heflin <hidden> · 2020-05-07
Re: [general question] rare silent data corruption when writing data · John Stoffel <hidden> · 2020-05-07
Re: [general question] rare silent data corruption when writing data · Michal Soltys <hidden> · 2020-05-07
Re: [general question] rare silent data corruption when writing data · John Stoffel <hidden> · 2020-05-08
Re: [general question] rare silent data corruption when writing data · Michal Soltys <hidden> · 2020-05-08
Re: [general question] rare silent data corruption when writing data · John Stoffel <hidden> · 2020-05-08
Re: [general question] rare silent data corruption when writing data · Chris Murphy <hidden> · 2020-05-08
Re: [general question] rare silent data corruption when writing data · Sarah Newman <hidden> · 2020-05-10
Re: [general question] rare silent data corruption when writing data · Sarah Newman <hidden> · 2020-05-10
Re: [general question] rare silent data corruption when writing data · Michal Soltys <hidden> · 2020-05-11
Re: [general question] rare silent data corruption when writing data · Sarah Newman <hidden> · 2020-05-11
Re: [general question] rare silent data corruption when writing data · Michal Soltys <hidden> · 2020-05-20
Re: [general question] rare silent data corruption when writing data · Michal Soltys <hidden> · 2020-05-07
Re: [general question] rare silent data corruption when writing data · Chris Dunlop <hidden> · 2020-05-13
Re: [general question] rare silent data corruption when writing data · John Stoffel <hidden> · 2020-05-13
Re: [general question] rare silent data corruption when writing data · Chris Dunlop <hidden> · 2020-05-14
Re: [general question] rare silent data corruption when writing data · Michal Soltys <hidden> · 2020-05-20

From: Michal Soltys <hidden>
Date: 2020-05-11 09:41:25

On 5/10/20 9:12 PM, Sarah Newman wrote:

On 5/10/20 12:05 PM, Sarah Newman wrote:

quoted

On 5/7/20 8:44 PM, Chris Murphy wrote:

quoted

I would change very little until you track this down, if the goal is
to track it down and get it fixed.

I'm not sure if LVM thinp is supported with LVM raid still, which if
it's not supported yet then I can understand using mdadm raid5 instead
of LVM raid5.


My apologies if this ideas was considered and discarded already, but 
the bug being hard to reproduce right after reboot and the error being 
exactly the size of a page sounds like a memory use after free bug or 
similar.

A debug kernel build with one or more of these options may find the 
problem:

CONFIG_DEBUG_PAGEALLOC
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
CONFIG_PAGE_POISONING + page_poison=1
CONFIG_KASAN

--Sarah

And on further reflection you may as well add these:

CONFIG_DEBUG_OBJECTS
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT
CONFIG_CRASH_DUMP (kdump)

+ anything else available. Basically turn debugging on all the way.

If you can reproduce reliably with these, then you can try the latest 
kernel with the same options and have some confidence the problem was 
legitimately fixed.

After compiling the kernel with above options enabled - and if this is 
the underlying issue as you suspect - will it just pop in dmesg if I hit 
this bug, or do I need some extra tools/preparation/etc. ?

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help