Re: [PATCH v5 1/2] net/handshake: Create a NETLINK service for handling handshake requests
From: Hannes Reinecke <hare@suse.de>
Date: 2023-02-28 15:48:45
On 2/28/23 15:28, Chuck Lever III wrote:
quoted
On Feb 28, 2023, at 1:58 AM, Hannes Reinecke [off-list ref] wrote: On 2/27/23 19:10, Chuck Lever III wrote:quoted
quoted
On Feb 27, 2023, at 12:21 PM, Hannes Reinecke [off-list ref] wrote:quoted
On 2/27/23 16:39, Chuck Lever III wrote:quoted
On Feb 27, 2023, at 10:14 AM, Hannes Reinecke [off-list ref] wrote: Problem here is with using different key materials. As the current handshake can only deal with one key at a time the only chance we have for several possible keys is to retry the handshake with the next key. But out of necessity we have to use the _same_ connection (as tlshd doesn't control the socket). So we cannot close the socket, and hence we can't notify userspace to give up the handshake attempt. Being able to send a signal would be simple; sending SIGHUP to userspace, and wait for the 'done' call. If it doesn't come we can terminate all attempts. But if we get the 'done' call we know it's safe to start with the next attempt.We solve this problem by enabling the kernel to provide all those materials to tlshd in one go.Ah. Right, that would work, too; provide all possible keys to the 'accept' call and let the userspace agent figure out what to do with them. That makes life certainly easier for the kernel side.quoted
I don't think there's a "retry" situation here. Once the handshake has failed, the client peer has to know to try again. That would mean retrying would have to be part of the upper layer protocol. Does an NVMe initiator know it has to drive another handshake if the first one fails, or does it rely on the handshake itself to try all available identities? We don't have a choice but to provide all the keys at once and let the handshake negotiation deal with it. I'm working on DONE passing multiple remote peer IDs back to the kernel now. I don't see why ACCEPT couldn't pass multiple peer IDs the other way.Nope. That's not required. DONE can only ever have one peer id (TLS 1.3 specifies that the client sends a list of identities, the server picks one, and sends that one back to the client). So for DONE we will only ever have 1 peer ID. If we allow for several peer IDs to be present in the client ACCEPT message then we'd need to include the resulting peer ID in the client DONE, too; otherwise we'll need it for the server DONE only. So all in all I think we should be going with the multiple IDs in the ACCEPT call (ie move the key id from being part of the message into an attribute), and have a peer id present in the DONE all for both versions, server and client.To summarize: --- The ACCEPT request (from tlshd) would have just the handler class "Which handler is responding". The kernel uses that to find a handshake request waiting for that type of handler. In our case, "tlshd". The ACCEPT response (from the kernel) would have the socket fd, the handshake parameters, and zero or more peer ID key serial numbers. (Today, just zero or one peer IDs).quoted
There is also an errno status in the ACCEPT response, whichthe kernel can use to indicate things like "no requests in that class were found" or that the request was otherwise improperly formed. --- The DONE request (from tlshd) would have the socket fd (and implicitly, the handler's PID), the session status, and zero or one remote peer ID key serial numbers.quoted
The DONE response (from the kernel) is an ACK. (Today it'smore than that, but that's broken and will be removed). --- For the DONE request, the session status is one of: 0: session established -- see @peerid for authentication status EIO: local error EACCES: handshake rejected For server handshake completion: @peerid contains the remote peer ID if the session was authenticated, or TLS_NO_PEERID if the session was not authenticated. status == EACCES if authentication material was present from both peers but verification failed. For client handshake completion: @peerid contains the remote peer ID if authentication was requested and the session was authenticated status == EACCES if authentication was requested and the session was not authenticated, or if verification failed. (Maybe client could work like the server side, and the kernel consumer would need to figure out if it cares whether there was authentication).Yes, that would be my preference. Always return @peerid for DONE if the TLS session was established.You mean if the TLS session was authenticated. The server won't receive a remote peer identity if the client peer doesn't authenticate.
Ah, yes, forgot about that. (PSK always 'authenticate' as the identity is that used to find the appropriate PSK ...)
quoted
We might also consider returning @peerid with EACCESS to indicate the offending ID.I'll look into that.quoted
quoted
Is that adequate?Yes, it is.What about the narrow set of DONE status values? You've recently wanted to add ENOMEM, ENOKEY, and EINVAL to this set. My experience is that these status values are nearly always obscured before they can get back to the requesting user. Can the kernel make use of ENOMEM, for example? It might be able to retry, I suppose... retrying is not sensible for the server side.
The usual problem: Retry or no retry. Sadly error numbers are no good indicator to that. Maybe we should take the NVMe approach and add a _different_ attribute indicating whether this particular error status should be retried.
quoted
So the only bone of contention is the timeout; as we won't be implementing signals I still think that we should have a 'timeout' attribute. And if only to feed the TLS timeout parameter for gnutls ...I'm still not seeing the case for making it an individual parameter for each handshake request. Maybe a config parameter, if a short timeout is actually needed... even then, maybe a built-in timeout is preferable to yet another tuning knob that can be abused.
The problem I see is that the kernel-side needs to make forward progress eventually, and calling into userspace is a good recipe of violating that principle. Sending a timeout value as a netlink parameter has the advantage the both sides are aware that there _is_ a timeout. The alternative would be an unconditional wait in the kernel, and a very real possibility of a stuck process.
I'd like to see some testing results to determine that a short timeout is the only way to handle corner cases.
Short timeouts are especially useful for testing and debugging; timeout handlers are prone to issues, and hence need a really good bashing to hash out issues. And not having a timeout is also not a good idea, see above. But yeah, in theory we could use a configuration timeout in tlshd. In the end, it's _just_ another netlink attribute, which might (or might not) be present. Which replaces a built-in value. I hadn't thought this to be such an issue ... Cheers, Hannes