Re: [PATCH v5] mm/gup: check page hwposion status for coredump.
From: David Hildenbrand <hidden>
Date: 2021-03-26 14:24:03
Also in:
lkml
On 26.03.21 15:09, David Hildenbrand wrote:
On 22.03.21 12:33, Aili Yao wrote:quoted
When we do coredump for user process signal, this may be one SIGBUS signal with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is resulted from ECC memory fail like SRAR or SRAO, we expect the memory recovery work is finished correctly, then the get_dump_page() will not return the error page as its process pte is set invalid by memory_failure(). But memory_failure() may fail, and the process's related pte may not be correctly set invalid, for current code, we will return the poison page, get it dumped, and then lead to system panic as its in kernel code. So check the hwpoison status in get_dump_page(), and if TRUE, return NULL. There maybe other scenario that is also better to check hwposion status and not to panic, so make a wrapper for this check, Thanks to David's suggestion([off-list ref]). Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine Signed-off-by: Aili Yao <redacted> Cc: David Hildenbrand <redacted> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <redacted> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Kravetz <redacted> Cc: Aili Yao <redacted> Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/gup.c | 4 ++++ mm/internal.h | 20 ++++++++++++++++++++ 2 files changed, 24 insertions(+)diff --git a/mm/gup.c b/mm/gup.c index e4c224c..6f7e1aa 100644 --- a/mm/gup.c +++ b/mm/gup.c@@ -1536,6 +1536,10 @@ struct page *get_dump_page(unsigned long addr) FOLL_FORCE | FOLL_DUMP | FOLL_GET); if (locked) mmap_read_unlock(mm);Thinking again, wouldn't we get -EFAULT from __get_user_pages_locked() when stumbling over a hwpoisoned page? See __get_user_pages_locked()->__get_user_pages()->faultin_page(): handle_mm_fault()->vm_fault_to_errno(), which translates VM_FAULT_HWPOISON to -EFAULT, unless FOLL_HWPOISON is set (-> -EHWPOISON) ?
Or doesn't that happen as you describe "But memory_failure() may fail, and the process's related pte may not be correctly set invalid" -- but why does that happen? On a similar thought, should get_user_pages() never return a page that has HWPoison set? E.g., check also for existing PTEs if the page is hwpoisoned? @Naoya, Oscar -- Thanks, David / dhildenb