Thread (40 messages) 40 messages, 4 authors, 2021-03-17

Re: [PATCH] mm,hwpoison: return -EBUSY when page already poisoned

From: Aili Yao <hidden>
Date: 2021-03-03 15:05:17
Also in: linux-mm, lkml

Possibly related (same subject, not in this thread)

Hi tony:
On Tue, 2 Mar 2021 19:39:53 -0800
"Luck, Tony" [off-list ref] wrote:
quoted
On Fri, Feb 26, 2021 at 10:59:15AM +0800, Aili Yao wrote:  
quoted
Hi naoya, tony:    
quoted
quoted
Idea for what we should do next ... Now that x86 is calling memory_failure()
from user context ... maybe parallel calls for the same page should
be blocked until the first caller completes so we can:
a) know that pages are unmapped (if that happens)
b) all get the same success/fail status      
One memory_failure() call changes the target page's status and
affects all mappings to all affected processes, so I think that
(ideally) we don't have to block other threads (letting them
early return seems fine).  Sometimes memory_failure() fails,
but even in such case, PG_hwpoison is set on the page and other
threads properly get SIGBUSs with this patch, so I think that
we can avoid the worst scenario (like system stall by MCE loop).
    
I agree with naoya's point, if we block for this issue, Does this change the result
that the process should be killed? Or is there something other still need to be considered?    
Ok ... no blocking ... 
I do think about blocking method and the error address issue with sigbus,here is my opinion, maybe helpful:

For blocking, if we block here, there are some undefine work i think should be done.
As we don't know the process B triggering this error again is early-kill or not, so the previous memory_failure() call may 
not add B on kill_list, even if B is on kill_list, the error level for B is not proper set, as B should get an AR SIGBUS;

So we can't just wait, We must have some logic adding the process B to kill list, and as this is an AR error
another change should be done to current code, we need more logic in kill_proc or some other place.

Even if all the work is done right. There is one more serious scenario though, we even don't know the current step the previous memory_failure() is on,
So previous modification may not be usefull at all; When this scenario happens, what we can do?  block or return ?
if finally we return, an error code should be taken back; so we have to go to error process logic and a signal without right address will be sent;

For error address with sigbus, i think this is not an issue resulted by the patch i post, before my patch, the issue is already there.
I don't find a realizable way to get the correct address for same reason --- we don't know whether the page mapping is there or not when
we got to kill_me_maybe(), in some case, we may get it, but there are a lot of parallel issue need to consider, and if failed we have to fallback
to the error brach again, remaining current code may be an easy option;

Any methods or patchs can solve the issue in a better way is OK to me, i want this issue fixed and in more complete way!

-- 
Thanks!
Aili Yao
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help