Re: [syzbot] [nfs?] INFO: task hung in nfsd_nl_listener_set_doit

From: Jeff Layton <jlayton@kernel.org>
Date: 2024-10-11 19:15:06
Also in: linux-nfs, lkml

On Fri, 2024-10-11 at 18:18 +0000, Chuck Lever III wrote:

quoted

On Oct 9, 2024, at 4:26 PM, Jeff Layton [off-list ref] wrote:

On Wed, 2024-09-04 at 10:23 -0400, Chuck Lever wrote:

quoted

On Mon, Sep 02, 2024 at 11:57:55AM +1000, NeilBrown wrote:

quoted

On Sun, 01 Sep 2024, syzbot wrote:

quoted

syzbot has found a reproducer for the following issue on:

I had a poke around using the provided disk image and kernel for
exploring.

I think the problem is demonstrated by this stack :

[<0>] rpc_wait_bit_killable+0x1b/0x160
[<0>] __rpc_execute+0x723/0x1460
[<0>] rpc_execute+0x1ec/0x3f0
[<0>] rpc_run_task+0x562/0x6c0
[<0>] rpc_call_sync+0x197/0x2e0
[<0>] rpcb_register+0x36b/0x670
[<0>] svc_unregister+0x208/0x730
[<0>] svc_bind+0x1bb/0x1e0
[<0>] nfsd_create_serv+0x3f0/0x760
[<0>] nfsd_nl_listener_set_doit+0x135/0x1a90
[<0>] genl_rcv_msg+0xb16/0xec0
[<0>] netlink_rcv_skb+0x1e5/0x430

No rpcbind is running on this host so that "svc_unregister" takes a
long time.  Maybe not forever but if a few of these get queued up all
blocking some other thread, then maybe that pushed it over the limit.

The fact that rpcbind is not running might not be relevant as the test
messes up the network.  "ping 127.0.0.1" stops working.

So this bug comes down to "we try to contact rpcbind while holding a
mutex and if that gets no response and no error, then we can hold the
mutex for a long time".

Are we surprised? Do we want to fix this?  Any suggestions how?

In the past, we've tried to address "hanging upcall" issues where
the kernel part of an administrative command needs a user space
service that isn't working or present. (eg mount needing a running
gssd)

If NFSD is using the kernel RPC client for the upcall, then maybe
adding the RPC_TASK_SOFTCONN flag might turn the hang into an
immediate failure.

IMO this should be addressed.

I sent a patch that does the above, but now I'm wondering if we ought
to take another approach. The listener array can be pretty long. What
if we instead were to just drop and reacquire the mutex in the loop at
strategic points? Then we wouldn't squat on the mutex for so long. 

Something like this maybe? It's ugly but it might prevent hung task
warnings, and listener setup isn't a fastpath anyway.

diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 3adbc05ebaac..5de01fb4c557 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c

@@ -2042,7 +2042,9 @@ int nfsd_nl_listener_set_doit(struct sk_buff *skb, struct genl_info *info)

               set_bit(XPT_CLOSE, &xprt->xpt_flags);
               spin_unlock_bh(&serv->sv_lock);

               svc_xprt_close(xprt);
+
+               /* ensure we don't squat on the mutex for too long */
+               mutex_unlock(&nfsd_mutex);
+               mutex_lock(&nfsd_mutex);
               spin_lock_bh(&serv->sv_lock);
       }

@@ -2082,6 +2084,10 @@ int nfsd_nl_listener_set_doit(struct sk_buff *skb, struct genl_info *info)
               /* always save the latest error */
               if (ret < 0)
                       err = ret;
+
+               /* ensure we don't squat on the mutex for too long */
+               mutex_unlock(&nfsd_mutex);
+               mutex_lock(&nfsd_mutex);
       }

       if (!serv->sv_nrthreads && list_empty(&nn->nfsd_serv->sv_permsocks))

I had a look at the rpcb upcall code a couple of weeks ago.
I'm not convinced that setting SOFTCONN in all cases will
help but unfortunately the reasons for my skepticism have
all but leaked out of my head.

Releasing and re-acquiring the mutex is often a sign of
a deeper problem.

It might look that way, but in this case we're iterating over the list
in the netlink call. There's no reason we need to hold the nfsd_mutex
over that entire process. Probably we can reduce the scope a bit
further and make this look a little less sketchy. I'll send a revised
patch.

I think you're in the right vicinity
but I'd like to better understand the actual cause of
the delay. The listener list shouldn't be all that long,
but maybe it has a unintentional loop in it?

I don't think that's possible given that this is in a netlink request,
but I'm no expert there.

I wish we had a reproducer for these issues.

The old interface only allowed one listener to be set at a time. The
new one allows a list, and I don't think it's bounded in any way. You
could send down hundreds of listeners at once, and if they all end up
hanging for a little while trying to talk to rpcbind, then that could
easily cause the hung task warning.

-- 
Jeff Layton [off-list ref]

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help