Re: [PATCH v1 6/6] mm/hwpoison: fix unpoison_memory()
From: Ding Hui <hidden>
Date: 2021-06-19 12:23:45
Also in:
lkml
On 2021/6/18 4:36 下午, HORIGUCHI NAOYA(堀口 直也) wrote:
On Thu, Jun 17, 2021 at 06:00:21PM +0800, Ding Hui wrote:quoted
On 2021/6/14 10:12, Naoya Horiguchi wrote:quoted
From: Naoya Horiguchi <redacted> After recent soft-offline rework, error pages can be taken off from buddy allocator, but the existing unpoison_memory() does not properly undo the operation. Moreover, due to the recent change on __get_hwpoison_page(), get_page_unless_zero() is hardly called for hwpoisoned pages. So __get_hwpoison_page() mostly returns zero (meaning to fail to grab page refcount) and unpoison just clears PG_hwpoison without releasing a refcount. That does not lead to a critical issue like kernel panic, but unpoisoned pages never get back to buddy (leaked permanently), which is not good.As I mention in [1], I'm not sure about the exactly meaning of "broken" in unpoison_memory(). Maybe the misunderstanding is: I think __get_hwpoison_page() mostly returns one for hwpoisoned page. In 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages"), page_handle_poison() is introduced, it will add refcount for all soft-offlineed hwpoison page. In memory_failure() for hard-offline,page_ref_inc() called on free page too, and for used page, we do not call put_page() after get_hwpoison_page() != 0. So all hwpoisoned page refcount must be great than zero when unpoison_memory() if regardless of racy.Hi, Ding, Thanks for the comment. I feel that I failed to define the exact issue in unpoison. Maybe I saw and misinterpreted some random error as unpoison's issue during developing other hwpoison patches, so please don't take serious my previous wrong word "broken", sorry about that. Anyway I reconsider how to handle this 6/6, maybe it will be a clear description of the problem, and will be simplified.quoted
Recently I tested loop soft-offline random pages and unpoison them for days, it works fine to me. (with bac9c6fa1f92 patched)Thank you for testing,
Hi Naoya, I'm afraid of my description about testing is ambiguous for others, let me clarify that I ran stress soft-offline test case from mce-test project (https://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git) for days to verify my modify about NR_FREE_PAGES (bac9c6fa1f92), without your current patchset, the case is loop soft-offline random pages and unpoison them, and it works basic fine to me. -- Thanks, -dinghui