Thread (6 messages) 6 messages, 3 authors, 2023-02-21

Re: Reoccurring 5 second delays during NFS calls

From: Jeff Layton <jlayton@kernel.org>
Date: 2023-02-07 15:22:05

On Tue, 2023-02-07 at 11:58 +0100, Florian Möller wrote:
Hi all,

we are currently in the process of migrating our file server infrastructure to 
NFS. In our test environments, the following problem has now occurred several 
times in certain situations:

A previously very fast NFS file operation suddenly takes 5 seconds longer - per 
file. This leads to applications running very slowly and severely delayed file 
operations.

Here are the details:

NFS server:
OS: Ubuntu 22.04.1, all patches installed
Kernel: Ubuntu Mainline, Versions
	6.1.7-060107-generic_6.1.7-060107.202301181200
         6.1.8-060108_6.1.8-060108.202301240742
         6.1.9-060109_6.1.9-060109.202302010835
Security options: all Kerberos security options are affected
  (The bug does not seem to occur without Kerberos security.)
Output of exportfs -v:
/export 
gss/krb5p(async,wdelay,hide,crossmnt,no_subtree_check,fsid=0,sec=sys,rw,secure,root_squash,no_all_squash)
/export 
gss/krb5(async,wdelay,hide,crossmnt,no_subtree_check,fsid=0,sec=sys,rw,secure,root_squash,no_all_squash)

I see you're using the -o async export option. Note that you may end up
with corrupt data on a server reboot (see exports(5) manpage).

Assuming you're aware of this and want to keep that anyway, then the
patch I just posted to the mailing list may help you, if the stalls are
coming during CLOSE operations:

https://lore.kernel.org/linux-nfs/9137413986ba9c2e83c030513fa9ae3358f30a85.camel@kernel.org/T/#mcb88f091263d07d8b9c13e6cc5ce0a0413d3f761 (local)
Client 1:
OS: Arch Linux, all patches installed
Kernel: 6.1.9-arch1-2, 6.1.9-arch1-1
Mount-Line: servername:/ on /nfs type nfs4 
(rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5p,clientaddr=XX.XX.XX.XX,local_lock=none,addr=YY.YY.YY.YY,_netdev)
krb5: 1.20.1-1
libevent: 2.1.12-4
nfs-utils: 2.6.2-1
util-linux: 2.38.1-1


Client 2:
OS: openSuSE 15.4, all patches installed
Kernel: 5.14.21-150400.24.41-default
Mount-Line: servername:/ on /nfs type nfs4 
(rw,relatime,sync,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,softerr,softreval,noac,noresvport,proto=tcp,timeo=1,retrans=2,sec=krb5,clientaddr=XX.XX.XX.XX,lookupcache=none,local_lock=none,addr=YY.YY.YY.YY) 

libgssapi3: 7.8.0-bp154.2.4.1
libevent: 2.1.8-2.23
nfs-client: 2.1.1-150100.10.27.1
util-linux: 2.37.2-140400.8.14.1


The error occurs for example if a file is touched twice:

touch testfile && echo "done" && touch testfile && echo "and again"

However, touching a large number of files (about 10000) with (pairwise) 
different filenames works fast.


Here is another example that triggers the error:

1st step: create many files (shell code in Z-shell syntax):

for i in {1..10000}; do
echo "test" > $i.txt
done

This is fast.

2nd step:

for i in {1..10000}; do
echo $i
cat $i.txt
done

This takes 5 seconds per cat(1) call.
After unmounting and mounting, the 2nd step also runs quickly at first. But 
after executing the 2nd step several times in a row, the error occurs again 
(quite soon, after the 2nd or 3rd execution).

We were not able to reproduce the error without a Kerberos security type set.


Attached are a log from the server and from the client. In both cases

rpcdebug -m nfs -s all
rpcdebug -m nfsd -s all
rpcdebug -m rpc -s all

was set.


Best regards,
Florian Möller

-- 
Jeff Layton [off-list ref]
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help