Thread (6 messages) 6 messages, 3 authors, 2023-02-21

Re: Reoccurring 5 second delays during NFS calls

From: Florian Möller <hidden>
Date: 2023-02-21 13:34:14

Hi all,

unfortunately the patch did not help, neither using -o async nor using -o sync. 
We did some more tests (which is the reason for the delay of this reply):

We used a qemu image housing both the NFS server and the client and did some 
kernel debugging.
OS: Ubuntu 22.04.1
Kernel: 5.15.78
Mount line: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,
	namlen=255,hard,proto=tcp,timeo=600,retrans=2,
	sec=krb5p,clientaddr=10.0.0.1,
	local_lock=none,addr=10.0.0.1

We touched a file and then touched the file again. This triggers the delay reliably.

We set breakpoints on all functions starting with nfs4 and on 
"update_open_stateid". The attached file "1sttouch.log" contains a gdb log of 
the first touch.
"2ndtouch.log" shows the gdb output of the second touch. The delay occurs in 
line 116 in update_open_stateid.

We then deleted all breakpoints and set a sole breakpoint on 
update_open_stateid. We touched the file again and used only the "next" command 
of gdb. The gdb output is in "2ndtouch-next.log", the delay occurs in line 8.

Please let us know if you need more information or if you want us to perform 
further tests.

Best regards,
Florian Möller

Am 07.02.23 um 16:21 schrieb Jeff Layton:
On Tue, 2023-02-07 at 11:58 +0100, Florian Möller wrote:
quoted
Hi all,

we are currently in the process of migrating our file server infrastructure to
NFS. In our test environments, the following problem has now occurred several
times in certain situations:

A previously very fast NFS file operation suddenly takes 5 seconds longer - per
file. This leads to applications running very slowly and severely delayed file
operations.

Here are the details:

NFS server:
OS: Ubuntu 22.04.1, all patches installed
Kernel: Ubuntu Mainline, Versions
	6.1.7-060107-generic_6.1.7-060107.202301181200
          6.1.8-060108_6.1.8-060108.202301240742
          6.1.9-060109_6.1.9-060109.202302010835
Security options: all Kerberos security options are affected
   (The bug does not seem to occur without Kerberos security.)
Output of exportfs -v:
/export
gss/krb5p(async,wdelay,hide,crossmnt,no_subtree_check,fsid=0,sec=sys,rw,secure,root_squash,no_all_squash)
/export
gss/krb5(async,wdelay,hide,crossmnt,no_subtree_check,fsid=0,sec=sys,rw,secure,root_squash,no_all_squash)

I see you're using the -o async export option. Note that you may end up
with corrupt data on a server reboot (see exports(5) manpage).

Assuming you're aware of this and want to keep that anyway, then the
patch I just posted to the mailing list may help you, if the stalls are
coming during CLOSE operations:

https://lore.kernel.org/linux-nfs/9137413986ba9c2e83c030513fa9ae3358f30a85.camel@kernel.org/T/#mcb88f091263d07d8b9c13e6cc5ce0a0413d3f761 (local)
quoted
Client 1:
OS: Arch Linux, all patches installed
Kernel: 6.1.9-arch1-2, 6.1.9-arch1-1
Mount-Line: servername:/ on /nfs type nfs4
(rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5p,clientaddr=XX.XX.XX.XX,local_lock=none,addr=YY.YY.YY.YY,_netdev)
krb5: 1.20.1-1
libevent: 2.1.12-4
nfs-utils: 2.6.2-1
util-linux: 2.38.1-1


Client 2:
OS: openSuSE 15.4, all patches installed
Kernel: 5.14.21-150400.24.41-default
Mount-Line: servername:/ on /nfs type nfs4
(rw,relatime,sync,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,softerr,softreval,noac,noresvport,proto=tcp,timeo=1,retrans=2,sec=krb5,clientaddr=XX.XX.XX.XX,lookupcache=none,local_lock=none,addr=YY.YY.YY.YY)

libgssapi3: 7.8.0-bp154.2.4.1
libevent: 2.1.8-2.23
nfs-client: 2.1.1-150100.10.27.1
util-linux: 2.37.2-140400.8.14.1


The error occurs for example if a file is touched twice:

touch testfile && echo "done" && touch testfile && echo "and again"

However, touching a large number of files (about 10000) with (pairwise)
different filenames works fast.


Here is another example that triggers the error:

1st step: create many files (shell code in Z-shell syntax):

for i in {1..10000}; do
echo "test" > $i.txt
done

This is fast.

2nd step:

for i in {1..10000}; do
echo $i
cat $i.txt
done

This takes 5 seconds per cat(1) call.
After unmounting and mounting, the 2nd step also runs quickly at first. But
after executing the 2nd step several times in a row, the error occurs again
(quite soon, after the 2nd or 3rd execution).

We were not able to reproduce the error without a Kerberos security type set.


Attached are a log from the server and from the client. In both cases

rpcdebug -m nfs -s all
rpcdebug -m nfsd -s all
rpcdebug -m rpc -s all

was set.


Best regards,
Florian Möller

-- 
Dr. Florian Möller
Universität Würzburg
Institut für Mathematik
Emil-Fischer-Straße 30
97074 Würzburg, Germany
Tel. +49 931 3185596

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help