RE: Zombie / Orphan open files
From: Andrew J. Romero <hidden>
Date: 2023-01-31 19:32:48
-----Original Message----- From: Chuck Lever III <chuck.lever@oracle.com> Almost. The protocol requires: After the client reboots, when it opens its first file, the client does a SETCLIENTID or EXCHANGE_ID to establish its lease on the server. All OPEN and LOCK state is managed under the umbrella of that lease (and that includes all files that client is managing). The client keeps the lease alive by renewing the lease every minute. If the client reboots (ie, does a subsequent SETCLIENTID or EXCHANGE_ID with a new boot verifier), the server has to purge all open file state for that client. If the client fails to renew its lease, the server is free to do what it wants -- it can purge the client's lease immediately, or it can wait until conflicting opens or locks come from other clients and then purge some or all of that client's lease. If the client can't or doesn't CLOSE that file, it will remain on the server until the client tells it (implicitly by not renewing or explicitly with a fresh ID) that the state is no longer needed; or until the server reboots and the client does not re-establish the OPEN state.
So , in general, this is true: - A lease is not "issued" for every file opened - A lease is not "issued" for every user running on an NFS-client host - In general. one lease is issued / managed for each NFS-client host ( if this is true, my server vendor is probably not forgetting to do something they should be doing )
But again, we need some way to confirm exactly how this is happening. Can you post your script, or capture client-server network traffic while the script does its thing?
The script is about simple as "hello world":
import sys
import fileinput
import os.path
import re
import time
def main():
StartID=int(raw_input("Enter Start ID: "))
TestDir=os.path.normcase('/nashome/r/romero/stuckopentest/dataout')
FPlist=[]
# open 2000 files and leave them open
for x in range(StartID, StartID+2000):
TestFilePath=os.path.join(TestDir, "TestFile-" + str(x))
print(TestFilePath)
# open file append file pointer to list
FPlist.append(open(TestFilePath,"w"))
# sleep for greater than Krb ticket life time
# 2000 files will be "stuck open" on the server
time.sleep(60*60*24)
main()
NOTE:
I don't expect people on this list to debug my issue.
My reason's for posting:
- Determine If my NAS vendor might be accidentally
not doing something they should be.
( I now don't really think this is the case. )
- Determine if this is a known behavior common to all NFS implementations
( Linux, ....etc ) and if so have you determine if this is a problem that should be addressed
in the spec and the implementations.