RE: [PATCH 1/3] hv_utils: Add the support of hibernation
From: Dexuan Cui <decui@microsoft.com>
Date: 2019-09-19 06:34:47
Also in:
lkml
From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Monday, September 16, 2019 1:46 AM Dexuan Cui [off-list ref] writes:quoted
quoted
From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Thursday, September 12, 2019 9:37 AMquoted
quoted
+static int util_suspend(struct hv_device *dev) +{ + struct hv_util_service *srv = hv_get_drvdata(dev); + + if (srv->util_cancel_work) + srv->util_cancel_work(); + + vmbus_close(dev->channel);And what happens if you receive a real reply from userspace after you close the channel? You've only cancelled timeout work so the driver will not try to reply by itself but this doesn't mean we won't try to write to the channel on legitimate replies from daemons. I think you need to block all userspace requests (hang in kernel until util_resume()). While I'm not sure we can't get away without it but I'd suggest we add a new HVUTIL_SUSPENDED state to the hvutil state machine. VitalyWhen we reach util_suspend(), all the userspace processes have been frozen: see kernel/power/hibernate.c: hibernate() -> freeze_processes(), so here we can not receive a reply from the userspace daemon.Let's add a WARN() or something then as if this ever happens this is going to be realy tricky to catch.
Looking at the path hibernate() -> freeze_processes() -> try_to_freeze_tasks(true) -> freeze_task() -> fake_signal_wake_up(), I'm sure when try_to_freeze_tasks(true) returns 0, all the user-space processes must be frozen in do_signal() -> get_signal() -> try_to_freeze() -> ... -> __refrigerator(). hibernate () -> hibernation_snapshot () -> dpm_suspend() -> ... -> util_suspend() only runs after hibernate() -> freeze_processes(), so I'm pretty sure we're safe here.
quoted
However, it looks there is indeed some tricky corner cases we need to deal with: in util_resume(), before we call vmbus_open(), all the userspace processes are still frozen, and the userspace daemon (e.g. hv_kvp_daemon) can be in any of these states: 1. the driver has not buffered any message for the daemon. This is good. 2. the driver has buffered a message for the daemon, and kvp_transaction.state is HVUTIL_USERSPACE_REQ. Later, the kvp daemon writes the response to the driver, and in kvp_on_msg() kvp_transaction.state is moved to HVUTIL_USERSPACE_RECV, but since cancel_delayed_work_sync(&kvp_timeout_work) is false (the work has been cancelled by util_suspend()), the driver reports nothing to the host, which is good as the VM has undergone a hibernation event and IMO the response may be stale and I believe the host is not expecting this response from the VM at all (the host side application that requested the KVP must have failed or timed out), but the bad thing is: the "state" remains in HVUTIL_USERSPACE_RECV, preventing hv_kvp_onchannelcallback() from reading from the channel ringbuffer. I suggest we work around this race condition by the below patch:--- a/drivers/hv/hv_kvp.c +++ b/drivers/hv/hv_kvp.c@@ -250,8 +250,8 @@ static int kvp_on_msg(void *msg, int len) */ if (cancel_delayed_work_sync(&kvp_timeout_work)) { kvp_respond_to_host(message, error); - hv_poll_channel(kvp_transaction.recv_channel,kvp_poll_wrapper);quoted
} + hv_poll_channel(kvp_transaction.recv_channel, kvp_poll_wrapper); return 0; } How do you like this?Is it safe to call hv_poll_channel() simultaneously on several CPUs? It seems it is as we're doing smp_call_function_single(channel->target_cpu, cb, channel, true);
Looks safe to me. The code has been there for years. :-)
(I'm asking because if it's not, than doing what you suggest will open the following window: timeout function kvp_timeout_func() is already running but the daemon is trying to reply at the same moment).quoted
I can imagine there is still a small chance that the state machine can run out of order, and the kvp daemon may exit due to the return values of read()/write() being -EINVAL, but the chance should be small enough in practice, and IMO the issue even exists in the current code when hibernation is not involved, e.g. kvp_timeout_func() and kvp_on_msg() can run concurrently; if kvp_on_msg() runs slowly due to some reason (e.g. the kvp daemon is stopped as I'm gdb'ing it), kvp_timeout_func() fires and moves the state to HVUTIL_READY; later kvp_on_msg() starts to run and returns -EINVAL, and the kvp daemon will exit(). IMHO here it looks extremely difficult to make things flawless (if that's even possible), so it's necessary to ask the daemons to auto-restart once they exit() unexpectedly. This can be achieved by configuring systemd properly for the kvp/vss/fcopy services.I think we can also teach daemons to ignore -EINVAL or switch to something like -EAGAIN in non-fatal cases.
Maybe the driver should return 0 rather than -EINVAL for the kvp daemon in some scenarios, since kvp is never 100% reliable.
quoted
I'm not sure introducing a HVUTIL_SUSPENDED state would solve all of the corner cases, but I'm sure that would further complicate the current code, at least to me. :-)Maybe, if we don't need it than we don't. Basically, I see the only advantage in having such state: it makes our tricks to support hibernation more visible in the code. Vitaly
Let me think about this. BTW, for vss, maybe the VM should not hibernate if there is a backup ongoing? -- if the file system is frozen by hv_vss_daemon, and the VM hibernates, then when the VM resumes back, it's almost always true that the VM won't receive the host's VSS_OP_THAW request, and the VM will end up in an unusable state. Thanks, -- Dexuan