Re: Reoccurring 5 second delays during NFS calls
From: Florian Möller <hidden>
Date: 2023-02-21 13:34:14
Hi all, unfortunately the patch did not help, neither using -o async nor using -o sync. We did some more tests (which is the reason for the delay of this reply): We used a qemu image housing both the NFS server and the client and did some kernel debugging. OS: Ubuntu 22.04.1 Kernel: 5.15.78 Mount line: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576, namlen=255,hard,proto=tcp,timeo=600,retrans=2, sec=krb5p,clientaddr=10.0.0.1, local_lock=none,addr=10.0.0.1 We touched a file and then touched the file again. This triggers the delay reliably. We set breakpoints on all functions starting with nfs4 and on "update_open_stateid". The attached file "1sttouch.log" contains a gdb log of the first touch. "2ndtouch.log" shows the gdb output of the second touch. The delay occurs in line 116 in update_open_stateid. We then deleted all breakpoints and set a sole breakpoint on update_open_stateid. We touched the file again and used only the "next" command of gdb. The gdb output is in "2ndtouch-next.log", the delay occurs in line 8. Please let us know if you need more information or if you want us to perform further tests. Best regards, Florian Möller Am 07.02.23 um 16:21 schrieb Jeff Layton:
On Tue, 2023-02-07 at 11:58 +0100, Florian Möller wrote:quoted
Hi all, we are currently in the process of migrating our file server infrastructure to NFS. In our test environments, the following problem has now occurred several times in certain situations: A previously very fast NFS file operation suddenly takes 5 seconds longer - per file. This leads to applications running very slowly and severely delayed file operations. Here are the details: NFS server: OS: Ubuntu 22.04.1, all patches installed Kernel: Ubuntu Mainline, Versions 6.1.7-060107-generic_6.1.7-060107.202301181200 6.1.8-060108_6.1.8-060108.202301240742 6.1.9-060109_6.1.9-060109.202302010835 Security options: all Kerberos security options are affected (The bug does not seem to occur without Kerberos security.) Output of exportfs -v: /export gss/krb5p(async,wdelay,hide,crossmnt,no_subtree_check,fsid=0,sec=sys,rw,secure,root_squash,no_all_squash) /export gss/krb5(async,wdelay,hide,crossmnt,no_subtree_check,fsid=0,sec=sys,rw,secure,root_squash,no_all_squash)I see you're using the -o async export option. Note that you may end up with corrupt data on a server reboot (see exports(5) manpage). Assuming you're aware of this and want to keep that anyway, then the patch I just posted to the mailing list may help you, if the stalls are coming during CLOSE operations: https://lore.kernel.org/linux-nfs/9137413986ba9c2e83c030513fa9ae3358f30a85.camel@kernel.org/T/#mcb88f091263d07d8b9c13e6cc5ce0a0413d3f761 (local)quoted
Client 1: OS: Arch Linux, all patches installed Kernel: 6.1.9-arch1-2, 6.1.9-arch1-1 Mount-Line: servername:/ on /nfs type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5p,clientaddr=XX.XX.XX.XX,local_lock=none,addr=YY.YY.YY.YY,_netdev) krb5: 1.20.1-1 libevent: 2.1.12-4 nfs-utils: 2.6.2-1 util-linux: 2.38.1-1 Client 2: OS: openSuSE 15.4, all patches installed Kernel: 5.14.21-150400.24.41-default Mount-Line: servername:/ on /nfs type nfs4 (rw,relatime,sync,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,softerr,softreval,noac,noresvport,proto=tcp,timeo=1,retrans=2,sec=krb5,clientaddr=XX.XX.XX.XX,lookupcache=none,local_lock=none,addr=YY.YY.YY.YY) libgssapi3: 7.8.0-bp154.2.4.1 libevent: 2.1.8-2.23 nfs-client: 2.1.1-150100.10.27.1 util-linux: 2.37.2-140400.8.14.1 The error occurs for example if a file is touched twice: touch testfile && echo "done" && touch testfile && echo "and again" However, touching a large number of files (about 10000) with (pairwise) different filenames works fast. Here is another example that triggers the error: 1st step: create many files (shell code in Z-shell syntax): for i in {1..10000}; do echo "test" > $i.txt done This is fast. 2nd step: for i in {1..10000}; do echo $i cat $i.txt done This takes 5 seconds per cat(1) call. After unmounting and mounting, the 2nd step also runs quickly at first. But after executing the 2nd step several times in a row, the error occurs again (quite soon, after the 2nd or 3rd execution). We were not able to reproduce the error without a Kerberos security type set. Attached are a log from the server and from the client. In both cases rpcdebug -m nfs -s all rpcdebug -m nfsd -s all rpcdebug -m rpc -s all was set. Best regards, Florian Möller
-- Dr. Florian Möller Universität Würzburg Institut für Mathematik Emil-Fischer-Straße 30 97074 Würzburg, Germany Tel. +49 931 3185596
Attachments
- 1sttouch.log [text/x-log] 7936 bytes · preview
- 2ndtouch.log [text/x-log] 8500 bytes · preview
- 2ndtouch-next.log [text/x-log] 1200 bytes · preview