Thread (14 messages) 14 messages, 3 authors, 2016-08-19

nvme/rdma initiator stuck on reboot

From: Steve Wise <hidden>
Date: 2016-08-18 17:59:29

Btw, in that case the patch is not actually correct, as even workqueue
with a higher concurrency level MAY deadlock under enough memory
pressure.  We'll need separate workqueues to handle this case I think.
quoted
Yes?  And the
reconnect worker was never completing?  Why is that?  Here are a few tidbits
about iWARP connections:  address resolution == neighbor discovery.  So if
the
quoted
neighbor is unreachable, it will take a few seconds for the OS to give up
and
quoted
fail the resolution.  If the neigh entry is valid and the peer becomes
unreachable during connection setup, it might take 60 seconds or so for a
connect operation to give up and fail.  So this is probably slowing the
reconnect thread down.   But shouldn't the reconnect thread notice that a
delete
quoted
is trying to happen and bail out?
I think we should aim for a state machine that can detect this, but
we'll have to see if that will end up in synchronization overkill.
Looking at the state machine I don't see why the reconnect thread would get
stuck continually rescheduling once the controller was deleted.  Changing from
RECONNECTING to DELETING will be done by nvme_change_ctrl_state().  So once that
happens, in __nvme_rdma_del_ctrl() , the thread running reconnect logic should
stop rescheduling due to this in the failure logic of
nvme_rdma_reconnect_ctrl_work():

...
requeue:
        /* Make sure we are not resetting/deleting */
        if (ctrl->ctrl.state == NVME_CTRL_RECONNECTING) {
                dev_info(ctrl->ctrl.device,
                        "Failed reconnect attempt, requeueing...\n");
                queue_delayed_work(nvme_rdma_wq, &ctrl->reconnect_work,
                                        ctrl->reconnect_delay * HZ);
        }
...

So something isn't happening like I think it is, I guess.

Also, even with the workqueue_alloc() change, a reboot during reconnect gets
stuck.  I never see the controllers getting deleted nor the unplug event handler
happening, so the reconnect thread seems to hang the shutdown/reboot...
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help