Thread (5 messages) 5 messages, 2 authors, 2019-06-28

Re: [PATCH] powerpc/rtas: Fix hang in race against concurrent cpu offline

From: Juliet Kim <hidden>
Date: 2019-06-26 21:51:01

On 6/25/19 12:29 PM, Nathan Lynch wrote:
Juliet Kim [off-list ref] writes:
quoted
The commit
(“powerpc/rtas: Fix a potential race between CPU-Offline & Migration)
attempted to fix a hang in Live Partition Mobility(LPM) by abandoning
the LPM attempt if a race between LPM and concurrent CPU offline was
detected.

However, that fix failed to notify Hypervisor that the LPM attempted
had been abandoned which results in a system hang.
It is surprising to me that leaving a migration unterminated would cause
Linux to hang. Can you explain more about how that happens?
PHYP will block further requests(next partition migration, dlpar etc) while
it's in suspending state. That would have a follow-on effect on the HMC and
potentially this and other partitions.
quoted
Fix this by sending a signal PHYP to cancel the migration, so that PHYP
can stop waiting, and clean up the migration.
This is well-spotted and rtas_ibm_suspend_me() needs to signal
cancellation in several error paths. But I don't agree that this is one
of them: this race is going to be a temporary condition in any
production setting, and retrying would allow the migration to succeed.
If LPM and CPU offine requests conflict with one another, it might be better
to let them fail and let the customer decide which he prefers. IBM i cancels
migration if the other OS components/operations veto migration. It’s consistent
with other OS behavior for LPM. I think all the IBM products should have a
consistent customer experience. Even if the race can be temporary, it still
could happen and can cause livelock.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help