Re: [PATCH 2/2] nvmet: fix a race condition between release_queue and io_work

From: Sagi Grimberg <sagi@grimberg.me>
Date: 2021-11-14 10:28:57
Subsystem: nvm express target driver, the rest · Maintainers: Christoph Hellwig, Sagi Grimberg, Chaitanya Kulkarni, Linus Torvalds


On 11/12/21 12:54 PM, Maurizio Lombardi wrote:

quoted hunk ↗ jump to hunk

Hi Sagi,

On Thu, Nov 04, 2021 at 02:59:53PM +0200, Sagi Grimberg wrote:

quoted

Right, after the call to cancel_work_sync we will know that io_work
is not running. Note that it can run as a result of a backend completion
but that is ok and we do want to let it run and return completion to the
host, but the socket should already be shut down for recv, so we cannot
get any other byte from the network.


I did some tests and I found out that kernel_recvmsg() sometimes returns
data even if the socket has been already shut down (maybe it's data it received
before the call to kernel_sock_shutdown() and waiting in some internal buffer?).

So when nvmet_sq_destroy() triggered io_work, recvmsg() still returned data
and the kernel crashed again despite the socket was closed.

Therefore, I think that after we shut down the socket we
should let io_work run and requeue itself until it finishes its job
and no more data is returned by recvmsg(),
one way to achieve this is to repeatedly call flush_work() until it returns
false.

Right now I am testing the patch below and it works perfectly.

Note that when the socket is closed recvmsg() might return 0,
nvmet_tcp_try_recv_data() should return -EAGAIN
in that case otherwise we end up in an infinite loop (io_work
will countinously requeue itself).

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 2f03a94725ae..7b441071c6b9 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c

@@ -1139,8 +1139,10 @@ static int nvmet_tcp_try_recv_data(struct nvmet_tcp_queue *queue)
  	while (msg_data_left(&cmd->recv_msg)) {
  		ret = sock_recvmsg(cmd->queue->sock, &cmd->recv_msg,
  			cmd->recv_msg.msg_flags);
-		if (ret <= 0)
+		if (ret < 0)
  			return ret;
+		else if (ret == 0)
+			return -EAGAIN;
  
  		cmd->pdu_recv += ret;
  		cmd->rbytes_done += ret;

@@ -1446,8 +1450,10 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
  	list_del_init(&queue->queue_list);
  	mutex_unlock(&nvmet_tcp_queue_mutex);
  
+	kernel_sock_shutdown(queue->sock, SHUT_RD);
+
  	nvmet_tcp_restore_socket_callbacks(queue);

I think you should shutdown the socket after you restore the socket
callbacks, and after that cancel_work_sync to deny a self-requeue.

-	flush_work(&queue->io_work);
+	while (flush_work(&queue->io_work));
  
  	nvmet_tcp_uninit_data_in_cmds(queue);
  	nvmet_sq_destroy(&queue->nvme_sq);

How about something like this:
--

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 6eb0b3153477..6425375faf5b 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c

@@ -1436,7 +1436,9 @@ static void nvmet_tcp_release_queue_work(struct

work_struct *w)
         mutex_unlock(&nvmet_tcp_queue_mutex);

         nvmet_tcp_restore_socket_callbacks(queue);
-       flush_work(&queue->io_work);
+       /* stop accepting incoming data */
+       kernel_sock_shutdown(queue->sock, SHUT_RD);
+       cancel_work_sync(&queue->io_work);

         nvmet_tcp_uninit_data_in_cmds(queue);
         nvmet_sq_destroy(&queue->nvme_sq);
--

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help