Thread (45 messages) 45 messages, 6 authors, 2023-03-02

RE: Zombie / Orphan open files

From: Andrew J. Romero <hidden>
Date: 2023-01-31 19:32:48

-----Original Message-----
From: Chuck Lever III <chuck.lever@oracle.com>

Almost. The protocol requires:

After the client reboots, when it opens its first file, the client
does a SETCLIENTID or EXCHANGE_ID to establish its lease on the
server. All OPEN and LOCK state is managed under the umbrella of
that lease (and that includes all files that client is managing).
The client keeps the lease alive by renewing the lease every minute.

If the client reboots (ie, does a subsequent SETCLIENTID or
EXCHANGE_ID with a new boot verifier), the server has to purge all
open file state for that client.

If the client fails to renew its lease, the server is free to do
what it wants -- it can purge the client's lease immediately, or
it can wait until conflicting opens or locks come from other
clients and then purge some or all of that client's lease.

If the client can't or doesn't CLOSE that file, it will remain
on the server until the client tells it (implicitly by not
renewing or explicitly with a fresh ID) that the state is no
longer needed; or until the server reboots and the client does
not re-establish the OPEN state.
So , in general, this is true:
  - A lease is not "issued" for every file opened
  - A lease is not "issued" for every user running on an NFS-client host
  - In general. one lease is issued / managed for each NFS-client host
( if this is true,  my server vendor is probably not forgetting to do
  something they should be doing )

But again, we need some way to confirm exactly how this is
happening. Can you post your script, or capture client-server
network traffic while the script does its thing?
The script is about simple as "hello world":

import sys
import fileinput
import os.path
import re
import time

def main():

   StartID=int(raw_input("Enter Start ID: "))

   TestDir=os.path.normcase('/nashome/r/romero/stuckopentest/dataout')

   FPlist=[]

   # open 2000 files and leave them open
   for x in range(StartID, StartID+2000):

      TestFilePath=os.path.join(TestDir, "TestFile-" + str(x))
      print(TestFilePath)

      # open file append file pointer to list
      FPlist.append(open(TestFilePath,"w"))



   # sleep for greater than Krb ticket life time
   # 2000 files will be "stuck open" on the server
   time.sleep(60*60*24)


main()


NOTE:

I don't expect people on this list to debug my issue.

My reason's for posting:

- Determine If my NAS vendor might be accidentally
  not doing something they should be.
  (  I now don't really think this is the case. )


- Determine if this is a known behavior common to all NFS implementations
   ( Linux, ....etc ) and if so have you determine if this is a problem that should be addressed
   in the spec and the implementations.  

























Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help