Re: Trying to reduce NFSv4 timeouts to a few seconds on an established connection

From: Jeff Layton <jlayton@kernel.org>
Date: 2023-01-27 13:33:19

On Thu, 2023-01-26 at 22:08 +0000, Andrew Klaassen wrote:

quoted

From: Andrew Klaassen <redacted>
Sent: Thursday, January 26, 2023 10:32 AM

quoted

From: Andrew Klaassen <redacted>
Sent: Monday, January 23, 2023 11:31 AM

Hello,

There's a specific NFSv4 mount on a specific machine which we'd
like
to timeout and return an error after a few seconds if the server
goes away.

I've confirmed the following on two different kernels, 4.18.0-
348.12.2.el8_5.x86_64 and 6.1.7-200.fc37.x86_64.

I've been able to get both autofs and the mount command to
cooperate,
so that the mount attempt fails after an arbitrary number of
seconds.
This mount command, for example, will fail after 6 seconds, as
expected based on the timeo=20,retrans=2,retry=0 options:

$ time sudo mount -t nfs4 -o
rw,relatime,sync,vers=4.2,rsize=131072,wsize=131072,namlen=255,acr
egmi
n

=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=20,ret
ra

quoted

n s=2,retry=0,sec=sys thor04:/mnt/thorfs04  /mnt/thor04
mount.nfs4: Connection timed out

real    0m6.084s
user    0m0.007s
sys     0m0.015s

However, if the share is already mounted and the server goes away,
the
timeout is always 2 minutes plus the time I expect based on timeo
and
retrans.  In this case, 2 minutes and 6 seconds:

$ time ls /mnt/thor04
ls: cannot access '/mnt/thor04': Connection timed out

real    2m6.025s
user    0m0.003s
sys     0m0.000s

Watching the outgoing packets in the second case, the pattern is
always the
same:
 - 0.2 seconds between the first two, then doubling each time
until
the two minute mark is exceeded (so the last NFS packet, which is
always the 11th packet, is sent around 1:45 after the first).
 - Then some generic packets that start exactly-ish on the two
minute
mark, 1 second between the first two, then doubling each time. 
(By
this time the NFS command has given up.)

11:10:21.898305 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
seq
14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889483
ecr
1589769203], length 200: NFS request xid 3614904256 196 getattr fh
0,2/53
11:10:22.105189 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
seq
14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889690
ecr
1589769203], length 200: NFS request xid 3614904256 196 getattr fh
0,2/53
11:10:22.313290 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
seq
14452:14652, ack 18561, win 501, options [nop,nop,TS val 834889898
ecr
1589769203], length 200: NFS request xid 3614904256 196 getattr fh
0,2/53
11:10:22.721269 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
seq
14452:14652, ack 18561, win 501, options [nop,nop,TS val 834890306
ecr
1589769203], length 200: NFS request xid 3614904256 196 getattr fh
0,2/53
11:10:23.569192 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
seq
14452:14652, ack 18561, win 501, options [nop,nop,TS val 834891154
ecr
1589769203], length 200: NFS request xid 3614904256 196 getattr fh
0,2/53
11:10:25.233212 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
seq
14452:14652, ack 18561, win 501, options [nop,nop,TS val 834892818
ecr
1589769203], length 200: NFS request xid 3614904256 196 getattr fh
0,2/53
11:10:28.497282 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
seq
14452:14652, ack 18561, win 501, options [nop,nop,TS val 834896082
ecr
1589769203], length 200: NFS request xid 3614904256 196 getattr fh
0,2/53
11:10:35.025219 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
seq
14452:14652, ack 18561, win 501, options [nop,nop,TS val 834902610
ecr
1589769203], length 200: NFS request xid 3614904256 196 getattr fh
0,2/53
11:10:48.337201 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
seq
14452:14652, ack 18561, win 501, options [nop,nop,TS val 834915922
ecr
1589769203], length 200: NFS request xid 3614904256 196 getattr fh
0,2/53
11:11:14.449303 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
seq
14452:14652, ack 18561, win 501, options [nop,nop,TS val 834942034
ecr
1589769203], length 200: NFS request xid 3614904256 196 getattr fh
0,2/53
11:12:08.721251 IP 10.30.13.2.916 > 10.31.3.13.2049: Flags [P.],
seq
14452:14652, ack 18561, win 501, options [nop,nop,TS val 834996306
ecr
1589769203], length 200: NFS request xid 3614904256 196 getattr fh
0,2/53
11:12:22.545394 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
seq
1375256951, win 64240, options [mss 1460,sackOK,TS val 835010130
ecr
0,nop,wscale 7], length 0
11:12:23.570199 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
seq
1375256951, win 64240, options [mss 1460,sackOK,TS val 835011155
ecr
0,nop,wscale 7], length 0
11:12:25.617284 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
seq
1375256951, win 64240, options [mss 1460,sackOK,TS val 835013202
ecr
0,nop,wscale 7], length 0
11:12:29.649219 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
seq
1375256951, win 64240, options [mss 1460,sackOK,TS val 835017234
ecr
0,nop,wscale 7], length 0
11:12:37.905274 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
seq
1375256951, win 64240, options [mss 1460,sackOK,TS val 835025490
ecr
0,nop,wscale 7], length 0
11:12:54.289212 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
seq
1375256951, win 64240, options [mss 1460,sackOK,TS val 835041874
ecr
0,nop,wscale 7], length 0
11:13:26.545304 IP 10.30.13.2.942 > 10.31.3.13.2049: Flags [S],
seq
1375256951, win 64240, options [mss 1460,sackOK,TS val 835074130
ecr
0,nop,wscale 7], length 0

I tried changing tcp_retries2 as suggested in another thread from
this list:

# echo 3 > /proc/sys/net/ipv4/tcp_retries2

...but it made no difference on either kernel.  The 2 minute
timeout
also doesn't seem to match with what I'd calculate from the
initial
value of tcp_retries2, which should give a much higher timeout.

The only clue I've been able to find is in the retry=n entry in
the
NFS
manpage:

" For TCP the default is 3 minutes, but system TCP connection
timeouts
will sometimes limit the timeout of each retransmission to around
2

minutes."

quoted

What I'm not able to make sense of:
 - The retry option says that it applies to mount operations, not
read/write operations.  However, in this case I'm seeing the 2
minute
delay on read/write operations but *not* mount operations.
 - A couple of hours of searching didn't lead me to any kernel
settings that would result in a 2 minute timeout.

Does anyone have any clues about a) what's happening and b) how to
get
our desired behaviour of being able to control both mount and
read/write timeouts down to a few seconds?

Thanks.

I thought that changing TCP_RTO_MAX in include/net/tcp.h from 120 to
something smaller and recompiling the kernel would change the 2
minute
timeout, but it had no effect.  I'm going to keep poking through the
kernel
code to see if there's a knob I can turn to change the 2 minute
timeout, so
that I can at least understand where it's coming from.

Any hints as to where I should be looking?

I believe I've made some progress with this today:

 - Calls to rpc_create() from fs/nfs/client.c are sending an
rpc_timeout struct with their args.
 - rpc_create() does *not* pass the timeout on to
xprt_create_transport(), which then can't pass it on to
xs_setup_tcp().
 - xs_setup_tcp(), having no timeout passed to it, uses
xs_tcp_default_timeout instead.
 - changing xs_tcp_default_timeout changes the "ls" timeout behaviour
I described above.

In theory all of this means that the timeout simply needs to be passed
through and used instead of xs_tcp_default_timeout.  I'm going to give
this a try tomorrow.

That's a great root-cause analysis. The interlocking timeouts involved
with NFS and its sockets can be really difficult to unwind.

Is there a way to automate this testcase? That might be nice to have in
xfstests or the nfstest suite.

quoted hunk ↗ jump to hunk

Here's what I'm going to try first; I'm no C programmer, though, so
any advice or corrections you might have would be appreciated.

Thanks.

Andrew

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 0b0b9f1eed46..1350c1f489f7 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c

@@ -532,6 +532,7 @@ struct rpc_clnt *rpc_create(struct rpc_create_args

*args)
                .addrlen = args->addrsize,
                .servername = args->servername,
                .bc_xprt = args->bc_xprt,
+               .timeout = args->timeout,
        };
        char servername[48];
        struct rpc_clnt *clnt;

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index aaa5b2741b79..adc79d94b59e 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c

@@ -3003,7 +3003,7 @@ static struct rpc_xprt *xs_setup_tcp(struct

xprt_create *args)
        xprt->idle_timeout = XS_IDLE_DISC_TO;

        xprt->ops = &xs_tcp_ops;
-       xprt->timeout = &xs_tcp_default_timeout;
+       xprt->timeout = args->timeout;

        xprt->max_reconnect_timeout = xprt->timeout->to_maxval;
        xprt->connect_timeout = xprt->timeout->to_initval *

Looks like you're probably on the right track. You're missing a few
things:

You'll need to add a "timeout" field to struct xprt_create in
include/linux/sunrpc/xprt.h, and there may be some other places that
either need to set the timeout in that structure, or do something with
that field when it's set.

Once you have something that fixes your reproducer, go ahead and post it
and we can help you work through whatever changes need to me made to
make it work.

Nice work!
-- 
Jeff Layton [off-list ref]

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help