Thread (6 messages) 6 messages, 3 authors, 2021-02-24

Re: [PATCH] nfsd: set RPC_CLNT_CREATE_NO_IDLE_TIMEOUT on callback client

From: Chuck Lever <chuck.lever@oracle.com>
Date: 2021-02-22 21:48:16

Possibly related (same subject, not in this thread)

On Feb 22, 2021, at 4:03 PM, Timo Rothenpieler [off-list ref] wrote:

On 21.02.2021 20:26, Chuck Lever wrote:
quoted
Thanks, Timo. A handful of nits below:
quoted
On Feb 21, 2021, at 1:27 PM, Timo Rothenpieler [off-list ref] wrote:

This tackles an issue where the callback client times out from
inactivity, causing operations like server side copy to never return on
the client side.
I was observing that issue frequently on my RDMA connected clients, it
does not seem to manifest on tcp connected clients.
Indeed, it is curious that the COPY issue does not occur on TCP
connections. You could try using the same tracing technique to
collect some data on TCP to see what is different.
I did pretty much the same procedure, mount, copy, wait 10 minutes, copy again, just with it mounted via TCP instead of RDMA.
(And I made sure to boot the server on an unpatched kernel).
Worked perfectly fine.
The pcap and trace are attached.

To me it looks like it just never hits xprt_disconnect_auto, despite the no timeout flag not being set.
Maybe it gets set implicitly somewhere for TCP? Or there is some keep alive going on for TCP, that never gives it a chance to time out.
net/sunrpc/xprtsock.c:

2999         /* backchannel */
3000         xprt_set_bound(xprt);
3001         xprt->bind_timeout = 0;
3002         xprt->reestablish_timeout = 0;
3003         xprt->idle_timeout = 0;

But no similar setting for net/sunrpc/xprtrdma/svc_rdma_backchannel.c:

253         xprt_set_bound(xprt);
254         xprt_set_connected(xprt);
255         xprt->bind_timeout = RPCRDMA_BIND_TO;
256         xprt->reestablish_timeout = RPCRDMA_INIT_REEST_TO;
257         xprt->idle_timeout = RPCRDMA_IDLE_DISC_TO;

And perhaps these have been wrong since 5d252f90a800 ("svcrdma: Add
class for RDMA backwards direction transport").

Try replacing your previous patch with one that changes all these
backchannel settings to 0. It should have the same effect as using
NO_IDLE_TIMEOUT.

quoted
quoted
However, it does not fix the actual issue of the callback channel
not getting re-established and the client being stuck in the call
forever. It just makes it a lot less likely to occur, as long as no
other circumstances cause the callback channel to be disconnected.
Agreed. I'm hoping Olga or Dai will look further into why recovery
is failing in this case (and whether that missing recovery action
is also observed only on RDMA transports!).
Please add a Signed-off-by: tag. See the "Sign your work" section
in Documentation/process/submitting-patches.rst
<trace.dat.xz><traffic.pcap.xz>
--
Chuck Lever


Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help