Re: [PATCH v5 1/2] net/handshake: Create a NETLINK service for handling... | netdev

Re: [PATCH v5 1/2] net/handshake: Create a NETLINK service for handling handshake requests

From: Hannes Reinecke <hare@suse.de>
Date: 2023-02-28 15:48:45

On 2/28/23 15:28, Chuck Lever III wrote:

quoted

On Feb 28, 2023, at 1:58 AM, Hannes Reinecke [off-list ref] wrote:

On 2/27/23 19:10, Chuck Lever III wrote:

quoted

On Feb 27, 2023, at 12:21 PM, Hannes Reinecke [off-list ref] wrote:

quoted

On 2/27/23 16:39, Chuck Lever III wrote:

quoted

On Feb 27, 2023, at 10:14 AM, Hannes Reinecke [off-list ref] wrote:

Problem here is with using different key materials.
As the current handshake can only deal with one key at a time
the only chance we have for several possible keys is to retry
the handshake with the next key.
But out of necessity we have to use the _same_ connection
(as tlshd doesn't control the socket). So we cannot close
the socket, and hence we can't notify userspace to give up the handshake attempt.
Being able to send a signal would be simple; sending SIGHUP to userspace, and wait for the 'done' call.
If it doesn't come we can terminate all attempts.
But if we get the 'done' call we know it's safe to start with the next attempt.

We solve this problem by enabling the kernel to provide all those
materials to tlshd in one go.

Ah. Right, that would work, too; provide all possible keys to the
'accept' call and let the userspace agent figure out what to do with
them. That makes life certainly easier for the kernel side.

quoted

I don't think there's a "retry" situation here. Once the handshake
has failed, the client peer has to know to try again. That would
mean retrying would have to be part of the upper layer protocol.
Does an NVMe initiator know it has to drive another handshake if
the first one fails, or does it rely on the handshake itself to
try all available identities?
We don't have a choice but to provide all the keys at once and
let the handshake negotiation deal with it.
I'm working on DONE passing multiple remote peer IDs back to the
kernel now. I don't see why ACCEPT couldn't pass multiple peer IDs
the other way.

Nope. That's not required.
DONE can only ever have one peer id (TLS 1.3 specifies that the client
sends a list of identities, the server picks one, and sends that one back
to the client). So for DONE we will only ever have 1 peer ID.
If we allow for several peer IDs to be present in the client ACCEPT message
then we'd need to include the resulting peer ID in the client DONE, too;
otherwise we'll need it for the server DONE only.

So all in all I think we should be going with the multiple IDs in the
ACCEPT call (ie move the key id from being part of the message into an
attribute), and have a peer id present in the DONE all for both versions,
server and client.

To summarize:
---
The ACCEPT request (from tlshd) would have just the handler class
"Which handler is responding". The kernel uses that to find a
handshake request waiting for that type of handler. In our case,
"tlshd".
The ACCEPT response (from the kernel) would have the socket fd,
the handshake parameters, and zero or more peer ID key serial
numbers. (Today, just zero or one peer IDs).

quoted

There is also an errno status in the ACCEPT response, which

the kernel can use to indicate things like "no requests in that
class were found" or that the request was otherwise improperly
formed.
---
The DONE request (from tlshd) would have the socket fd (and
implicitly, the handler's PID), the session status, and zero
or one remote peer ID key serial numbers.

quoted

The DONE response (from the kernel) is an ACK. (Today it's

more than that, but that's broken and will be removed).
---
For the DONE request, the session status is one of:
0: session established -- see @peerid for authentication status
EIO: local error
EACCES: handshake rejected
For server handshake completion:
@peerid contains the remote peer ID if the session was
authenticated, or TLS_NO_PEERID if the session was not
authenticated.
status == EACCES if authentication material was present from
both peers but verification failed.
For client handshake completion:
@peerid contains the remote peer ID if authentication was
requested and the session was authenticated
status == EACCES if authentication was requested and the
session was not authenticated, or if verification failed.
(Maybe client could work like the server side, and the
kernel consumer would need to figure out if it cares
whether there was authentication).

Yes, that would be my preference. Always return @peerid
for DONE if the TLS session was established.

You mean if the TLS session was authenticated. The server
won't receive a remote peer identity if the client peer
doesn't authenticate.

Ah, yes, forgot about that.
(PSK always 'authenticate' as the identity is that used to
find the appropriate PSK ...)

quoted

We might also consider returning @peerid with EACCESS
to indicate the offending ID.

I'll look into that.

quoted

Is that adequate?

Yes, it is.

What about the narrow set of DONE status values? You've
recently wanted to add ENOMEM, ENOKEY, and EINVAL to
this set. My experience is that these status values are
nearly always obscured before they can get back to the
requesting user.

Can the kernel make use of ENOMEM, for example? It might
be able to retry, I suppose... retrying is not sensible
for the server side.

The usual problem: Retry or no retry.
Sadly error numbers are no good indicator to that.
Maybe we should take the NVMe approach and add a _different_
attribute indicating whether this particular error status
should be retried.

quoted

So the only bone of contention is the timeout; as we won't
be implementing signals I still think that we should have
a 'timeout' attribute. And if only to feed the TLS timeout
parameter for gnutls ...

I'm still not seeing the case for making it an individual
parameter for each handshake request. Maybe a config
parameter, if a short timeout is actually needed... even
then, maybe a built-in timeout is preferable to yet another
tuning knob that can be abused.

The problem I see is that the kernel-side needs to make forward
progress eventually, and calling into userspace is a good recipe
of violating that principle.
Sending a timeout value as a netlink parameter has the advantage
the both sides are aware that there _is_ a timeout.
The alternative would be an unconditional wait in the kernel,
and a very real possibility of a stuck process.

I'd like to see some testing results to determine that a
short timeout is the only way to handle corner cases.

Short timeouts are especially useful for testing and debugging;
timeout handlers are prone to issues, and hence need a really good
bashing to hash out issues.
And not having a timeout is also not a good idea, see above.

But yeah, in theory we could use a configuration timeout in tlshd.

In the end, it's _just_ another netlink attribute, which might
(or might not) be present. Which replaces a built-in value.
I hadn't thought this to be such an issue ...

Cheers,

Hannes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help