Re: NVMe induced NULL deref in bt_iter()
From: Ming Lei <hidden>
Date: 2017-07-03 09:40:01
Also in:
linux-nvme
On Sun, Jul 02, 2017 at 02:56:56PM +0300, Sagi Grimberg wrote:
quoted hunk ↗ jump to hunk
On 02/07/17 13:45, Max Gurtovoy wrote:quoted
On 6/30/2017 8:26 PM, Jens Axboe wrote:quoted
Hi Max,Hi Jens,quoted
I remembered you reporting this. I think this is a regression introduced with the scheduling, since ->rqs[] isn't static anymore. ->static_rqs[] is, but that's not indexable by the tag we find. So I think we need to guard those with a NULL check. The actual requests themselves are static, so we know the memory itself isn't going away. But if we race with completion, we could find a NULL there, validly. Since you could reproduce it, can you try the below?I still can repro the null deref with this patch applied.quoted
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index d0be72ccb091..b856b2827157 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c@@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap,unsigned int bitnr, void *data) bitnr += tags->nr_reserved_tags; rq = tags->rqs[bitnr]; - if (rq->q == hctx->queue) + if (rq && rq->q == hctx->queue) iter_data->fn(hctx, rq, iter_data->data, reserved); return true; }@@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap,unsigned int bitnr, void *data) if (!reserved) bitnr += tags->nr_reserved_tags; rq = tags->rqs[bitnr]; - - iter_data->fn(rq, iter_data->data, reserved); + if (rq) + iter_data->fn(rq, iter_data->data, reserved); return true; }see the attached file for dmesg output. output of gdb: (gdb) list *(blk_mq_flush_busy_ctxs+0x48) 0xffffffff8127b108 is in blk_mq_flush_busy_ctxs (./include/linux/sbitmap.h:234). 229 230 for (i = 0; i < sb->map_nr; i++) { 231 struct sbitmap_word *word = &sb->map[i]; 232 unsigned int off, nr; 233 234 if (!word->word) 235 continue; 236 237 nr = 0; 238 off = i << sb->shift; when I change the "if (!word->word)" to "if (word && !word->word)" I can get null deref at "nr = find_next_bit(&word->word, word->depth, nr);". Seems like somehow word becomes NULL. Adding the linux-nvme guys too. Sagi has mentioned that this can be null only if we remove the tagset while I/O is trying to get a tag and when killing the target we get into error recovery and periodic reconnects, which does _NOT_ include freeing the tagset, so this is probably the admin tagset. Sagi, you've mention a patch for centrelizing the treatment of the admin tagset to the nvme core. I think I missed this patch, so can you please send a pointer to it and I'll check if it helps ?Hmm, In the above flow we should not be freeing the tag_set, not on admin as well. The target keep removing namespaces and finally removes the subsystem which generates a error recovery flow. What we at least try to do is: 1. mark rdma queues as not live 2. stop all the sw queues (admin and io) 3. fail inflight I/Os 4. restart all sw queues (to fast fail until we recover) We shouldn't be freeing the tagsets (although we might update them when we recover and cpu map changed - which I don't think is happening). However, I do see a difference between bt_tags_for_each and blk_mq_flush_busy_ctxs (checks tags->rqs not being NULL). Unrelated to this I think we should quiesce/unquiesce the admin_q instead of stop/start because it respects the submission path rcu [1]. It might hide the issue, but given that we never free the tagset its seems like it's not in nvme-rdma (max, can you see if this makes the issue go away?) [1]: --diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index e3996db22738..094873a4ee38 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c@@ -785,7 +785,7 @@ static void nvme_rdma_error_recovery_work(structwork_struct *work) if (ctrl->ctrl.queue_count > 1) nvme_stop_queues(&ctrl->ctrl); - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q); + blk_mq_quiesce_queue(ctrl->ctrl.admin_q); /* We must take care of fastfail/requeue all our inflight requests */ if (ctrl->ctrl.queue_count > 1)@@ -798,7 +798,8 @@ static void nvme_rdma_error_recovery_work(structwork_struct *work) * queues are not a live anymore, so restart the queues to fail fast * new IO */ - blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true); + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + blk_mq_kick_requeue_list(ctrl->ctrl.admin_q); nvme_start_queues(&ctrl->ctrl); nvme_rdma_reconnect_or_remove(ctrl);@@ -1651,7 +1652,7 @@ static void nvme_rdma_shutdown_ctrl(structnvme_rdma_ctrl *ctrl) if (test_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags)) nvme_shutdown_ctrl(&ctrl->ctrl); - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q); + blk_mq_quiesce_queue(ctrl->ctrl.admin_q); blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, nvme_cancel_request, &ctrl->ctrl); nvme_rdma_destroy_admin_queue(ctrl);
Yeah, the above change is correct, for any canceling requests in this way we should use blk_mq_quiesce_queue(). Thanks, Ming