Re: Zombie / Orphan open files

Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Andrew Klaassen <hidden> · 2023-01-23
Re: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Chuck Lever III <chuck.lever@oracle.com> · 2023-01-23
RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Andrew Klaassen <hidden> · 2023-01-23
RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Andrew Klaassen <hidden> · 2023-01-26
RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Andrew Klaassen <hidden> · 2023-01-26
Re: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Jeff Layton <jlayton@kernel.org> · 2023-01-27
RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Andrew Klaassen <hidden> · 2023-01-30
Re: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Jeff Layton <jlayton@kernel.org> · 2023-01-30
RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Andrew Klaassen <hidden> · 2023-01-30
Re: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Jeff Layton <jlayton@kernel.org> · 2023-01-30
Zombie / Orphan open files · Andrew J. Romero <hidden> · 2023-01-30
Re: Zombie / Orphan open files · Chuck Lever III <chuck.lever@oracle.com> · 2023-01-31
Re: Zombie / Orphan open files · Jeff Layton <jlayton@kernel.org> · 2023-01-31
RE: Zombie / Orphan open files · Andrew J. Romero <hidden> · 2023-01-31
Re: Zombie / Orphan open files · Jeff Layton <jlayton@kernel.org> · 2023-01-31
Re: Zombie / Orphan open files · Chuck Lever III <chuck.lever@oracle.com> · 2023-01-31
Re: Zombie / Orphan open files · Chuck Lever III <chuck.lever@oracle.com> · 2023-01-31
RE: Zombie / Orphan open files · Andrew J. Romero <hidden> · 2023-01-31
Re: Zombie / Orphan open files · Chuck Lever III <chuck.lever@oracle.com> · 2023-01-31
RE: Zombie / Orphan open files · Andrew J. Romero <hidden> · 2023-01-31
Re: Zombie / Orphan open files · Chuck Lever III <chuck.lever@oracle.com> · 2023-01-31
RE: Zombie / Orphan open files · Andrew J. Romero <hidden> · 2023-01-31
Re: Zombie / Orphan open files · Olga Kornievskaia <hidden> · 2023-01-31
Re: Zombie / Orphan open files · Olga Kornievskaia <hidden> · 2023-01-31
RE: Zombie / Orphan open files · Andrew J. Romero <hidden> · 2023-01-31
Re: Zombie / Orphan open files · Olga Kornievskaia <hidden> · 2023-01-31
RE: Zombie / Orphan open files · Andrew J. Romero <hidden> · 2023-01-31
Re: Zombie / Orphan open files · Olga Kornievskaia <hidden> · 2023-01-31
RE: Zombie / Orphan open files · Andrew J. Romero <hidden> · 2023-01-31
Re: Zombie / Orphan open files · Olga Kornievskaia <hidden> · 2023-02-01
Re: Zombie / Orphan open files · Jeff Layton <jlayton@kernel.org> · 2023-01-31
Re: Zombie / Orphan open files · Jeff Layton <jlayton@kernel.org> · 2023-01-31
Re: Zombie / Orphan open files · Olga Kornievskaia <hidden> · 2023-01-31
RE: Zombie / Orphan open files · Andrew J. Romero <hidden> · 2023-01-31
RE: Zombie / Orphan open files · Frank Filz <hidden> · 2023-01-31
Re: Zombie / Orphan open files · Olga Kornievskaia <hidden> · 2023-01-31
RE: Zombie / Orphan open files · Frank Filz <hidden> · 2023-01-31
RE: Zombie / Orphan open files · Andrew J. Romero <hidden> · 2023-01-31
RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Andrew Klaassen <hidden> · 2023-02-02
RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Andrew Klaassen <hidden> · 2023-02-06
RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Andrew Klaassen <hidden> · 2023-02-06
RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Andrew Klaassen <hidden> · 2023-02-27
Re: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Jeff Layton <jlayton@kernel.org> · 2023-02-28
RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Andrew Klaassen <hidden> · 2023-03-02
RE: Trying to reduce NFSv4 timeouts to a few seconds on an established connection · Andrew Klaassen <hidden> · 2023-03-02

From: Chuck Lever III <chuck.lever@oracle.com>
Date: 2023-01-31 18:06:13

On Jan 31, 2023, at 11:59 AM, Andrew J. Romero [off-list ref] wrote:

quoted

-----Original Message-----
From: Chuck Lever III <chuck.lever@oracle.com>

quoted

On Jan 31, 2023, at 9:42 AM, Andrew J. Romero [off-list ref] wrote:

In a large campus environment, usage of the relevant memory pool will eventually get so
high that a server-side reboot will be needed.

The above is sticking with me a bit.

Rebooting the server should force clients to re-establish state.

Are they not re-establishing open file state for users whose
ticket has expired?

quoted

I would think each client would re-establish
state for those open files anyway, and the server would be in the
same overcommitted state it was in before it rebooted.


When the number of opens gets close to the limit which would result in
a disruptive  NFSv4 service interruption ( currently 128K open files is the limit),
I do the reboot ( actually I transfer the affected NFS serving resource
from one NAS cluster-node to the other NAS cluster node ... this based on experience
is like a 99.9% "non-disruptive reboot" of the affected NFS serving resource )

Before the resource transfer there will be ~126K open files 
( from the NAS perspective )
0.1 seconds after the resource transfer there will be
close to zero files open. Within a few seconds there will
be ~2000 and within a few minutes there will be ~2100.
During the rest of the day I only see a slow rise in the average number
of opens to maybe 2200. ( my take is ~2100 files were "active opens" before and after
 the resource transfer ,  the rest of the 126K opens were zombies
that the clients were no longer using ).

That's not the way state recovery works. Clients will reopen only
the files that are still in use. If the clients don't open the
"zombie" files again, then I'm fairly certain the applications
have already closed those files.

In other words, the server might have an internal resource leak
instead.

In 4-6 months
the number of opens from the NAS perspective will slowly
creep back up to the limit.

We will need to have a better understanding of where the leaks
actually come from. You have provided one way that an open leak
can happen, but that way doesn't line up with the evidence you
have here. So I agree that something is amiss, but more analysis
is necessary.

What release of the Linux kernel is your NAS device running?


--
Chuck Lever

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help