Re: Spurious instability with NFSoRDMA under moderate load

From: Chuck Lever III <chuck.lever@oracle.com>
Date: 2021-08-11 00:19:52
Also in: linux-nfs

Possibly related (same subject, not in this thread)

2021-08-10 · Re: Spurious instability with NFSoRDMA under moderate load · Timo Rothenpieler <hidden>
2021-08-10 · Re: Spurious instability with NFSoRDMA under moderate load · Chuck Lever III <chuck.lever@oracle.com>
2021-08-10 · Re: Spurious instability with NFSoRDMA under moderate load · Timo Rothenpieler <hidden>
2021-06-21 · Re: Spurious instability with NFSoRDMA under moderate load · Chuck Lever III <chuck.lever@oracle.com>
2021-06-21 · Re: Spurious instability with NFSoRDMA under moderate load · Timo Rothenpieler <hidden>

On Aug 10, 2021, at 8:07 PM, Timo Rothenpieler [off-list ref] wrote:

On 10.08.2021 23:40, Timo Rothenpieler wrote:

quoted

On 10.08.2021 19:17, Chuck Lever III wrote:

What I see in this data is that the server is reporting

    SEQ4_STATUS_CB_PATH_DOWN

and the client is attempting to recover (repeatedly) using
BIND_CONN_TO_SESSION. But apparently the recovery didn't
actually work, because the server continues to report a
callback path problem.

[1712389.125641] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
[1712389.129264] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!

[1712389.171953] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
[1712389.178361] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!

[1712389.195606] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
[1712389.203891] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!

I guess it's time to switch to tracing on the server side
to see if you can nail down why the server's callback
requests are failing. On your NFS server, run:

  # trace-cmd record -e nfsd -e sunrpc -e rpcgss -e rpcrdma

at roughly the same point during your test that you captured
the previous client-side trace.

I wonder if reverting 6820bf77864d on the server, to have an easier way to reproduce this state, would be worth it.
Cause it seems like the actual underlying issue is the inability of the NFS server (and/or client) to reestablish the backchannel if it gets disconnected for whatever reason?
Right now I already rebooted the client, and everything is working again, so I'll have to wait a potentially long time for this to happen again otherwise.

I actually ended up doing that, and sure enough, on 5.12.19, with 6820bf77864d reverted, it ends up getting stuck very quickly.
Using xfs_io with copy_range this time, to have a more isolated test case.

Though the trace logs I'm getting from that look different, see attached traces and rpcdebug enabled dmesg from both client and server.
There is no appearance of "nfs41_handle_sequence_flag_errors" or "nfs4_bind_conn_to_session" whatsoever, and it does not seem like it's trying to recover the timed out callback channel at all.

Thanks, I’ll take a look at the new information tomorrow.

So I'm not sure if the other issue that spuriously happens is related.
But there definitely is an issue with it not re-establishing the backchannel, so fixing one might fix the other as well.
<rpcdebug.tar.xz>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help