Thread (45 messages) 45 messages, 6 authors, 2023-03-02

Re: Zombie / Orphan open files

From: Jeff Layton <jlayton@kernel.org>
Date: 2023-01-31 22:28:28

On Tue, 2023-01-31 at 17:14 -0500, Olga Kornievskaia wrote:
On Tue, Jan 31, 2023 at 2:55 PM Andrew J. Romero [off-list ref] wrote:
quoted

quoted
What you are describing sounds like a bug in a system (be it client or
server). There is state that the client thought it closed but the
server still keeping that state.
Hi Olga

Based on my simple test script experiment,
Here's a summary of what I believe is happening

1. An interactive user starts a process that opens a file or multiple files

2. A disruption, that prevents
   NFS-client <-> NFS-server communication,
   occurs while the file is open.  This could be due to
   having the file open a long time or due to opening the file
   too close to the time of disruption.

( I believe the most common "disruption" is
  credential expiration )

3) The user's process terminates before the disruption
     is cleared.  ( or stated another way ,  the disruption is not cleared until after the user
    process terminates )

   At the time the user process terminates, the process
   can not tell the server to close the server-side file state.

  After the process terminates, nothing will ever tell the server
  to close the files.  The now zombie open files will continue to
  consume server-side resources.

  In environments with many users, the problem is significant

My reasons for posting:

- Are not to have your team  help troubleshoot my specific issue
   ( that would be quite rude )

they are:

- Determine If my NAS vendor might be accidentally
  not doing something they should be.
  (  I now don't really think this is the case. )
It's hard to say who's at fault here without having some more info
like tracepoints or network traces.
quoted
- Determine if this is a known behavior common to all NFS implementations
   ( Linux, ....etc ) and if so have your team determine if this is a problem that should be addressed
   in the spec and the implementations.
What you describe  --- having different views of state on the client
and server -- is not a known common behaviour.

I have tried it on my Kerberos setup.
Gotten a 5min ticket.
As a user opened a file in a process that went to sleep.
My user credentials have expired (after 5mins). I verified that by
doing an "ls" on a mounted filesystem which resulted in permission
denied error.
Then I killed the application that had an opened file. This resulted
in a NFS CLOSE being sent to the server using the machine's gss
context (which is a default behaviour of the linux client regardless
of whether or not user's credentials are valid).

Basically as far as I can tell, a linux client can handle cleaning up
state when user's credentials have expired.
quoted
That's pretty much what I expected from looking at the code. I think
this is done via the call to nfs4_state_protect. That calls:

       if (test_bit(sp4_mode, &clp->cl_sp4_flags)) {                   
                msg->rpc_cred = rpc_machine_cred();
                ...                            
       }

Could it be that cl_sp4_flags doesn't have NFS_SP4_MACH_CRED_CLEANUP set
on his clients? AFAICT, that comes from the server. It also looks like
cl_sp4_flags may not get set on a NFSv4.0 mount.

Olga, can you test that with a v4.0 mount?
-- 
Jeff Layton [off-list ref]
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help