Re: question about the performance impact of sec=krb5

From: Rick Macklem <hidden>
Date: 2023-02-13 04:30:41

On Sun, Feb 12, 2023 at 9:47 AM Chuck Lever III [off-list ref] wrote:

CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to IThelp@uoguelph.ca

quoted

On Feb 12, 2023, at 1:01 AM, Wang Yugui [off-list ref] wrote:

Hi,

question about the performance of sec=krb5.

https://learn.microsoft.com/en-us/azure/azure-netapp-files/performance-impact-kerberos
Performance impact of krb5:
      Average IOPS decreased by 53%
      Average throughput decreased by 53%
      Average latency increased by 3.2 ms

Looking at the numbers in this article... they don't
seem quite right. Here are the others:

quoted

Performance impact of krb5i:
      • Average IOPS decreased by 55%
      • Average throughput decreased by 55%
      • Average latency increased by 0.6 ms
Performance impact of krb5p:
      • Average IOPS decreased by 77%
      • Average throughput decreased by 77%
      • Average latency increased by 1.6 ms

I would expect krb5p to be the worst in terms of
latency. And I would like to see round-trip numbers
reported: what part of the increase in latency is
due to server versus client processing?

This is also remarkable:

quoted

When nconnect is used in Linux, the GSS security context is shared between all the nconnect connections to a particular server. TCP is a reliable transport that supports out-of-order packet delivery to deal with out-of-order packets in a GSS stream, using a sliding window of sequence numbers. When packets not in the sequence window are received, the security context is discarded, and a new security context is negotiated. All messages sent with in the now-discarded context are no longer valid, thus requiring the messages to be sent again. Larger number of packets in an nconnect setup cause frequent out-of-window packets, triggering the described behavior. No specific degradation percentages can be stated with this behavior.


So, does this mean that nconnect makes the GSS sequence
window problem worse, or that when a window underrun
occurs it has broader impact because multiple connections
are affected?

Seems like maybe nconnect should set up a unique GSS
context for each xprt. It would be helpful to file a bug.

Here's a snippet from RFC2203:
   In a successful response, the seq_window field is set to the sequence
   window length supported by the server for this context.  This window
   specifies the maximum number of client requests that may be
   outstanding for this context. The server will accept "seq_window"
   requests at a time, and these may be out of order.  The client may
   use this number to determine the number of threads that can
   simultaneously send requests on this context.

It would be interesting to know what size of window Netapp filers specify
in the reply when context initialization completes.
A simple fix might be to get Netapp to increase the window, since they
have observed the problem.
FreeBSD servers use 128.  I have no idea what other servers use.

rick

quoted

and then in 'man 5 nfs'
sec=krb5  provides cryptographic proof of a user's identity in each RPC request.

Kerberos has performance impacts due to the crypto-
graphic operations that are performed on even small
fixed-sized sections of each RPC message, when using
sec=krb5 (no 'i' or 'p').

quoted

Is there a option of better performance to check krb5 only when mount.nfs4,
not when file acess?

If you mount with NFSv4 and sec=sys from a Linux NFS
client that has a keytab, the client will attempt to
use krb5i for lease management operations (such as
EXCHANGE_ID) but it will continue to use sec=sys for
user authentication. That's not terribly secure.

A better answer would be to make Kerberos faster.
I've done some recent work on improving the overhead
of using message digest algorithms with GSS-API, but
haven't done any specific measurement. I'm sure
there's more room for optimization.

Even better would be to use a transport layer security
service. Amazon has EFS and Oracle Cloud has something
similar, but we're working on a standard approach that
uses TLSv1.3.


--
Chuck Lever

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help