Re: [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status

From: Jeff Layton <jlayton@kernel.org>
Date: 2023-04-06 18:11:02

On Thu, 2023-04-06 at 13:33 -0400, Jeff Layton wrote:

On Tue, 2023-03-14 at 09:19 -0700, dai.ngo@oracle.com wrote:

quoted

On 3/8/23 11:03 AM, dai.ngo@oracle.com wrote:

quoted

On 3/8/23 10:50 AM, Chuck Lever III wrote:

quoted

On Mar 8, 2023, at 1:45 PM, Dai Ngo [off-list ref] wrote:

Currently call_bind_status places a hard limit of 3 to the number of
retries on EACCES error. This limit was done to accommodate the 
behavior
of a buggy server that keeps returning garbage when the NLM daemon is
killed on the NFS server. However this change causes problem for other
servers that take a little longer than 9 seconds for the port mapper to


Actually, the EACCES error means that the host doesn't have the port
registered. That could happen if (e.g.) the host had a NFSv3 mount up
with an NLM connection and then crashed and rebooted and didn't remount
it.

quoted

become ready when the NFS server is restarted.

This patch removes this hard coded limit and let the RPC handles
the retry according to whether the export is soft or hard mounted.

To avoid the hang with buggy server, the client can use soft mount for
the export.

Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests")
Reported-by: Helen Chao <redacted>
Tested-by: Helen Chao <redacted>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>

Helen is the royal queen of ^C  ;-)

Did you try ^C on a mount while it waits for a rebind?

She uses a test script that restarts the NFS server while NLM lock test
is running. The failure is random, sometimes it fails and sometimes it
passes depending on when the LOCK/UNLOCK requests come in so I think
it's hard to time it to do the ^C, but I will ask.

We did the test with ^C and here is what we found.

For synchronous RPC task the signal was delivered to the RPC task and
the task exit with -ERESTARTSYS from __rpc_execute as expected.

For asynchronous RPC task the process that invokes the RPC task to send
the request detected the signal in rpc_wait_for_completion_task and exits
with -ERESTARTSYS. However the async RPC was allowed to continue to run
to completion. So if the async RPC task was retrying an operation and
the NFS server was down, it will retry forever if this is a hard mount
or until the NFS server comes back up.

The question for the list is should we propagate the signal to the async
task via rpc_signal_task to stop its execution or just leave it alone as is.

That is a good question.

I like the patch overall, as it gets rid of a special one-off retry
counter, but I too share some concerns about retrying indefinitely when
an server goes missing.

quoted

Propagating a signal seems like the right thing to do. Looks like
rpcb_getport_done would also need to grow a check for RPC_SIGNALLED ?

It sounds pretty straightforward otherwise.

Erm, except that some of these xprts are in the context of the server.

For instance: server-side lockd sometimes has to send messages to the
clients (e.g. GRANTED callbacks). Suppose we're trying to send a message
to the client, but it has crashed and rebooted...or maybe the client's
address got handed to another host that isn't doing NFS at all so the
NLM service will never come back.

This will mean that those RPCs will retry forever now in this situation.
I'm not sure that's what we want. Maybe we need some way to distinguish
between "user-driven" RPCs and those that aren't?

As a simpler workaround, would it work to just increase the number of
retries here? There's nothing magical about 3 tries. ISTM that having it
retry enough times to cover at least a minute or two would be
acceptable.
-- 
Jeff Layton [off-list ref]

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help