Thread (2 messages) 2 messages, 2 authors, 2021-08-10

Re: Spurious instability with NFSoRDMA under moderate load

From: Chuck Lever III <chuck.lever@oracle.com>
Date: 2021-08-10 17:17:50
Also in: linux-nfs

Possibly related (same subject, not in this thread)

On Aug 10, 2021, at 10:06 AM, Timo Rothenpieler [off-list ref] wrote:

On 10.08.2021 14:49, Timo Rothenpieler wrote:
quoted
Ok, so this just happened again for the first time since upgrading to 5.12.
Exact same thing, except this time no error cqe was dumped simultaneously (It still appeared in dmesg, but over a week before the issue showed up). So I'll assume it's unrelated to this issue.
I had no issues while running 5.12.12 and below. Recently (14 days ago or so) updated to 5.12.19, and now it's happening again.
Unfortunately, with how rarely this issue happens, this can either be a regression between those two versions, or it was still present all along and just never triggered for several months.
Makes me wonder if this is somehow related to the problem described in "NFS server regression in kernel 5.13 (tested w/ 5.13.9)". But the pattern of the issue does not look all that similar, given that for me, the hangs never recover, and I have RDMA in the mix.
Here's some trace and debug data I collected while it was in a state where everything using copy_range would get stuck.
The file(s) in question is named "tmp-copy-*", with a bunch of numbers at the end.

This also seems to be entirely a client side issue, since other NFS clients on the same server work just fine the whole time, and a reboot of that one affected client makes it work normally again, without touching the server.

In this instance right now, I was even unable to cleanly reboot the client machine, since it got completely stuck, seemingly unable to unmount the NFS shares, getting stuck the same way other operations would.

What confuses me the most is that while it is in this messed up state, it generally still works fine, and it's only a few very specific operations (only one I know of for sure is copy_range) get stuck.
The processes also aren't unkillable. a SIGTERM or SIGKILL will end them, and also flushes NFS. Captured that in the dmesg_output2.
What I see in this data is that the server is reporting

   SEQ4_STATUS_CB_PATH_DOWN

and the client is attempting to recover (repeatedly) using
BIND_CONN_TO_SESSION. But apparently the recovery didn't
actually work, because the server continues to report a
callback path problem.

[1712389.125641] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
[1712389.129264] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!

[1712389.171953] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
[1712389.178361] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!

[1712389.195606] nfs41_handle_sequence_flag_errors: "10.110.10.200" (client ID 6765f8600a675814) flags=0x00000001
[1712389.203891] nfs4_bind_conn_to_session: bind_conn_to_session was successful for server 10.110.10.200!

I guess it's time to switch to tracing on the server side
to see if you can nail down why the server's callback
requests are failing. On your NFS server, run:

 # trace-cmd record -e nfsd -e sunrpc -e rpcgss -e rpcrdma

at roughly the same point during your test that you captured
the previous client-side trace.

--
Chuck Lever


Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help