Re: [PATCH v2 6/7] nvme-tcp: Support KeyUpdate

From: Alistair Francis <hidden>
Date: 2025-09-17 03:14:42
Also in: linux-doc, linux-nfs, linux-nvme, lkml

On Tue, Sep 16, 2025 at 11:04 PM Hannes Reinecke [off-list ref] wrote:

On 9/5/25 04:46, alistair23@gmail.com wrote:

quoted

From: Alistair Francis <redacted>

If the nvme_tcp_try_send() or nvme_tcp_try_recv() functions return
EKEYEXPIRED then the underlying TLS keys need to be updated. This occurs
on an KeyUpdate event.

If the NVMe Target (TLS server) initiates a KeyUpdate this patch will
allow the NVMe layer to process the KeyUpdate request and forward the
request to userspace. Userspace must then update the key to keep the
connection alive.

This patch allows us to handle the NVMe target sending a KeyUpdate
request without aborting the connection. At this time we don't support
initiating a KeyUpdate.

Link: https://datatracker.ietf.org/doc/html/rfc8446#section-4.6.3
Signed-off-by: Alistair Francis <redacted>
---
v2:
  - Don't change the state
  - Use a helper function for KeyUpdates
  - Continue sending in nvme_tcp_send_all() after a KeyUpdate
  - Remove command message using recvmsg

  drivers/nvme/host/tcp.c | 73 +++++++++++++++++++++++++++++++++++++++--
  1 file changed, 70 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 776047a71436..b6449effc2ac 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c

@@ -171,6 +171,7 @@ struct nvme_tcp_queue {
      bool                    tls_enabled;
      u32                     rcv_crc;
      u32                     snd_crc;
+     key_serial_t            user_session_id;
      __le32                  exp_ddgst;
      __le32                  recv_ddgst;
      struct completion       tls_complete;

@@ -210,6 +211,7 @@ static int nvme_tcp_start_tls(struct nvme_ctrl *nctrl,
                            struct nvme_tcp_queue *queue,
                            key_serial_t pskid,
                            handshake_key_update_type keyupdate);
+static void update_tls_keys(struct nvme_tcp_queue *queue);

  static inline struct nvme_tcp_ctrl *to_tcp_ctrl(struct nvme_ctrl *ctrl)
  {

@@ -393,6 +395,14 @@ static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
      do {
              ret = nvme_tcp_try_send(queue);
      } while (ret > 0);
+
+     if (ret == -EKEYEXPIRED) {
+             update_tls_keys(queue);
+
+             do {
+                     ret = nvme_tcp_try_send(queue);
+             } while (ret > 0);
+     }
  }

  static inline bool nvme_tcp_queue_has_pending(struct nvme_tcp_queue *queue)

@@ -1347,6 +1357,8 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
  done:
      if (ret == -EAGAIN) {
              ret = 0;
+     } else if (ret == -EKEYEXPIRED) {
+             goto out;
      } else if (ret < 0) {
              dev_err(queue->ctrl->ctrl.device,
                      "failed to send request %d\n", ret);

@@ -1371,9 +1383,56 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
      queue->nr_cqe = 0;
      consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
      release_sock(sk);
+
+     /* If we received EINVAL from read_sock then it generally means the
+      * other side sent a command message. So let's try to clear it from
+      * our queue with a recvmsg, otherwise we get stuck in an infinite
+      * loop.
+      */
+     if (consumed == -EINVAL) {
+             char cbuf[CMSG_LEN(sizeof(char))] = {};
+             struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
+             struct bio_vec bvec;
+
+             bvec_set_virt(&bvec, (void *)cbuf, sizeof(cbuf));
+             iov_iter_bvec(&msg.msg_iter, ITER_DEST, &bvec, 1, sizeof(cbuf));
+
+             msg.msg_control = cbuf;
+             msg.msg_controllen = sizeof(cbuf);
+
+             consumed = sock_recvmsg(sock, &msg, msg.msg_flags);
+     }
+
      return consumed == -EAGAIN ? 0 : consumed;
  }

+static void update_tls_keys(struct nvme_tcp_queue *queue)
+{
+     int qid = nvme_tcp_queue_id(queue);
+     int ret;
+
+     dev_dbg(queue->ctrl->ctrl.device,
+             "updating key for queue %d\n", qid);
+
+     cancel_work(&queue->io_work);
+     handshake_req_cancel(queue->sock->sk);
+     handshake_sk_destruct_req(queue->sock->sk);
+

Careful here. The RFC fully expects to have several KeyUpdate requests
pending (eg if both sides decide so initiate a KeyUpdate at the same
time). And cancelling a handshake request would cause tlshd/gnutls
to lose track of the generation counter and generate an invalid
traffic secret.
I would just let it rip and don't bother with other handshake
requests.

Unfortunately that doesn't work as future calls to
`handshake_req_hash_add()` will fail.

I now think that's a bug in `handshake_complete()` and I have a better
fix in the next version.

quoted

+     nvme_stop_keep_alive(&(queue->ctrl->ctrl));
+     flush_work(&(queue->ctrl->ctrl).async_event_work);
+

Oh bugger. Seems like gnutls is generating the KeyUpdate message
itself, and we have to wait for that.

Yes, we have gnutls generate the message.

So much for KeyUpdate being transparent without having to stop I/O...

Can't we fix gnutls to make sending the KeyUpdate message and changing
the IV parameters an atomic operation? That would be a far better

I'm not sure I follow.

ktls-utils will first restore the gnutls session. Then have gnutls
trigger a KeyUpdate.gnutls will send a KeyUpdate and then tell the
kernel the new keys. The kernel cannot send or encrypt any data after
the KeyUpdate has been sent until the keys are updated.

I don't see how we could make it an atomic operation. We have to stop
the traffic between sending a KeyUpdate and updating the keys.
Otherwise we will send invalid data.

interface, as then we would not need to stop I/O and the handshake
process could run fully asynchronous to normal I/O...

quoted

+     ret = nvme_tcp_start_tls(&(queue->ctrl->ctrl),
+                              queue, queue->ctrl->ctrl.tls_pskid,
+                              HANDSHAKE_KEY_UPDATE_TYPE_RECEIVED);
+
+     if (ret < 0) {
+             dev_err(queue->ctrl->ctrl.device,
+                     "failed to update the keys %d\n", ret);
+             nvme_tcp_fail_request(queue->request);
+             nvme_tcp_done_send_req(queue);
+     }
+}
+
  static void nvme_tcp_io_work(struct work_struct *w)
  {
      struct nvme_tcp_queue *queue =

@@ -1389,15 +1448,21 @@ static void nvme_tcp_io_work(struct work_struct *w)
                      mutex_unlock(&queue->send_mutex);
                      if (result > 0)
                              pending = true;
-                     else if (unlikely(result < 0))
+                     else if (unlikely(result < 0)) {
+                             if (result == -EKEYEXPIRED)
+                                     update_tls_keys(queue);

How exactly can we get -EKEYEXPIRED when _sending_?

Good point. You can't with this current patch set. I have patches on
top of this that will generate a KeyUpate as part of the send
operation, which I plan to submit after this series.

So this is a bit of prep work to setup the NVMe layer to handle
sending and receiving KeyUpdate requests. I can drop this change from
the series if that's prefered?

To my understanding that would have required userspace to intercept
here trying (or even sending) a KeyUpdate message, right?

Not necessarily. The TLS layer can trigger a KeyUpdate independent of
userspace. This would happen for example if the sequence count was
about to overflow, which is what I use in my testing. Userspace has no
idea of the current sequence number, so it can't be involved. The
kernel will need to start the KeyUpdate send if the rec_seq is about
to overflow.

So really not something we should see during normal operation.
As mentioned in my previous mail we should rather code the
KeyUpdate process itself here, too.
Namely: Trigger the KeyUpdate via userspace (eg by writing into the
tls_key attribute for the controller), and then have the kernel side
to call out into tlshd to initiate the KeyUpdate 'handshake'.

Yeah, I agree about exposing a way for userspace to trigger an update.
That would only be for testing though, as in normal operation
userspace has no insight into the current connection state. In a
production system the kernel TLS layer will need to initiate a
KeyUpdate.

That way we have identical flow of control for both the sending
and receiving side.

Incidentally: the RFC has some notion about 'request_update' setting
in the KeyUpdate message. Is that something we have to care about at
this level?

It is something we will need to care about. At this stage it isn't
supported as it adds a little bit of complexity, but I should be able
to extend the current approach to support a request_update.

Alistair

Cheers,

Hannes
--
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help