RE: [PATCH 1/3] hv_utils: Add the support of hibernation

From: Dexuan Cui <decui@microsoft.com>
Date: 2019-09-19 06:34:47
Also in: lkml

From: Vitaly Kuznetsov <vkuznets@redhat.com>
Sent: Monday, September 16, 2019 1:46 AM
Dexuan Cui [off-list ref] writes:

quoted

From: Vitaly Kuznetsov <vkuznets@redhat.com>
Sent: Thursday, September 12, 2019 9:37 AM

quoted

+static int util_suspend(struct hv_device *dev)
+{
+	struct hv_util_service *srv = hv_get_drvdata(dev);
+
+	if (srv->util_cancel_work)
+		srv->util_cancel_work();
+
+	vmbus_close(dev->channel);

And what happens if you receive a real reply from userspace after you
close the channel? You've only cancelled timeout work so the driver
will not try to reply by itself but this doesn't mean we won't try to
write to the channel on legitimate replies from daemons.

I think you need to block all userspace requests (hang in kernel until
util_resume()).

While I'm not sure we can't get away without it but I'd suggest we add a
new HVUTIL_SUSPENDED state to the hvutil state machine.
Vitaly

When we reach util_suspend(), all the userspace processes have been
frozen: see kernel/power/hibernate.c: hibernate() -> freeze_processes(),
so here we can not receive a reply from the userspace daemon.

Let's add a WARN() or something then as if this ever happens this is
going to be realy tricky to catch.

Looking at the path hibernate() -> freeze_processes() -> 
try_to_freeze_tasks(true) -> freeze_task() -> fake_signal_wake_up(), I'm
sure when try_to_freeze_tasks(true) returns 0, all the user-space processes
must be frozen in do_signal() -> get_signal() -> try_to_freeze() -> ... -> 
__refrigerator().

hibernate () -> hibernation_snapshot () -> dpm_suspend() -> ... -> 
util_suspend() only runs after hibernate() -> freeze_processes(), so I'm
pretty sure we're safe here.

quoted

However, it looks there is indeed some tricky corner cases we need to deal
with: in util_resume(), before we call vmbus_open(), all the userspace
processes are still frozen, and the userspace daemon (e.g. hv_kvp_daemon)
can be in any of these states:

1. the driver has not buffered any message for the daemon. This is good.

2. the driver has buffered a message for the daemon, and
kvp_transaction.state is HVUTIL_USERSPACE_REQ. Later, the kvp daemon
writes the response to the driver, and in kvp_on_msg()
kvp_transaction.state is moved to HVUTIL_USERSPACE_RECV, but since
cancel_delayed_work_sync(&kvp_timeout_work) is false (the work has
been cancelled by util_suspend()), the driver reports nothing to the host,
which is good as the VM has undergone a hibernation event and IMO the
response may be stale and I believe the host is not expecting this
response from the VM at all (the host side application that requested the
KVP must have failed or timed out), but the bad thing is: the "state"
remains in HVUTIL_USERSPACE_RECV, preventing
hv_kvp_onchannelcallback() from reading from the channel ringbuffer.

I suggest we work around this race condition by the below patch:

--- a/drivers/hv/hv_kvp.c
+++ b/drivers/hv/hv_kvp.c

@@ -250,8 +250,8 @@ static int kvp_on_msg(void *msg, int len)
         */
        if (cancel_delayed_work_sync(&kvp_timeout_work)) {
                kvp_respond_to_host(message, error);
-               hv_poll_channel(kvp_transaction.recv_channel,

kvp_poll_wrapper);

quoted

        }
+       hv_poll_channel(kvp_transaction.recv_channel, kvp_poll_wrapper);

        return 0;
 }

How do you like this?

Is it safe to call hv_poll_channel() simultaneously on several CPUs? It
seems it is as we're doing

smp_call_function_single(channel->target_cpu, cb, channel, true);

Looks safe to me. The code has been there for years. :-)

(I'm asking because if it's not, than doing what you suggest will open
the following window: timeout function kvp_timeout_func() is already
running but the daemon is trying to reply at the same moment).

quoted

I can imagine there is still a small chance that the state machine can run
out of order, and the kvp daemon may exit due to the return values of
read()/write() being -EINVAL, but the chance should be small enough in
practice, and IMO the issue even exists in the current code when
hibernation is not involved, e.g. kvp_timeout_func() and kvp_on_msg()
can run concurrently; if kvp_on_msg() runs slowly due to some reason
(e.g. the kvp daemon is stopped as I'm gdb'ing it), kvp_timeout_func()
fires and moves the state to HVUTIL_READY; later kvp_on_msg() starts
to run and returns -EINVAL, and the kvp daemon will exit().

IMHO here it looks extremely difficult to make things flawless (if that's
even possible), so it's necessary to ask the daemons to auto-restart once
they exit() unexpectedly. This can be achieved by configuring systemd
properly for the kvp/vss/fcopy services.

I think we can also teach daemons to ignore -EINVAL or switch to
something like -EAGAIN in non-fatal cases.

Maybe the driver should return 0 rather than -EINVAL for the kvp daemon
in some scenarios, since kvp is never 100% reliable.

quoted

I'm not sure introducing a HVUTIL_SUSPENDED state would solve all
of the corner cases, but I'm sure that would further complicate the
current code, at least to me. :-)

Maybe, if we don't need it than we don't. Basically, I see the only
advantage in having such state: it makes our tricks to support
hibernation more visible in the code.
Vitaly

Let me think about this. 

BTW, for vss, maybe the VM should not hibernate if there is a backup 
ongoing? -- if the file system is frozen by hv_vss_daemon, and the VM
hibernates, then when the VM resumes back, it's almost always true that 
the VM won't receive the host's VSS_OP_THAW request, and the VM will
end up in an unusable state.

Thanks,
-- Dexuan

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help