Re: [PATCH v3 2/4] nfsd: rework refcounting in filecache

From: Chuck Lever III <chuck.lever@oracle.com>
Date: 2022-10-28 21:23:54

On Oct 28, 2022, at 5:03 PM, Jeff Layton [off-list ref] wrote:

On Fri, 2022-10-28 at 20:39 +0000, Chuck Lever III wrote:

quoted

On Oct 28, 2022, at 4:13 PM, Jeff Layton [off-list ref] wrote:

On Fri, 2022-10-28 at 19:49 +0000, Chuck Lever III wrote:

quoted

On Oct 28, 2022, at 2:57 PM, Jeff Layton [off-list ref] wrote:

The filecache refcounting is a bit non-standard for something searchable
by RCU, in that we maintain a sentinel reference while it's hashed. This
in turn requires that we have to do things differently in the "put"
depending on whether its hashed, which we believe to have led to races.

There are other problems in here too. nfsd_file_close_inode_sync can end
up freeing an nfsd_file while there are still outstanding references to
it, and there are a number of subtle ToC/ToU races.

Rework the code so that the refcount is what drives the lifecycle. When
the refcount goes to zero, then unhash and rcu free the object.

With this change, the LRU carries a reference. Take special care to
deal with it when removing an entry from the list.

I can see a way of making this patch a lot cleaner. It looks like there's
a fair bit of renaming and moving of functions -- that can go in clean
up patches before doing the heavy lifting.

Is this something that's really needed? I'm already basically rewriting
this code. Reshuffling the old code around first will take a lot of time
and we'll still end up with the same result.

I did exactly this for the nfs4_file rhash changes. It took just a couple
of hours. The outcome is that you can see exactly, in the final patch in
that series, how the file_hashtbl -> rhltable substitution is done.

Making sure each of the changes is more or less mechanical and obvious
is a good way to ensure no-one is doing something incorrect. That's why
folks like to use cocchinelle.

Trust me, it will be much easier to figure out in a year when we have
new bugs in this code if we split up this commit just a little.

Sigh. It seems pointless to rearrange code that is going to be replaced,
but I'll do it. It'll probably be next week though.

quoted

I'm still not sold on the idea of a synchronous flush in nfsd_file_free().

I think that we need to call this there to ensure that writeback errors
are handled. I worry that if try to do this piecemeal, we could end up
missing errors when they fall off the LRU.

quoted

That feels like a deadlock waiting to happen and quite difficult to
reproduce because I/O there is rarely needed. It could help to put a
might_sleep() in nfsd_file_fsync(), at least, but I would prefer not to
drive I/O in that path at all.

I don't quite grok the potential for a deadlock here. nfsd_file_free
already has to deal with blocking activities due to it effective doing a
close(). This is just another one. That's why nfsd_file_put has a
might_sleep in it (to warn its callers).

Currently nfsd_file_put() calls nfsd_file_flush(), which calls
vfs_fsync(). That can't be called while holding a spinlock.

nfsd_file_free (and hence, nfsd_file_put) can never be called with a
spinlock held. That's true even before this set. Both functions can
block.

Dead horse: in the current code base, nfsd_file_free() can be called
via nfsd_file_close_inode_sync(), which is an API external to
filecache.c. But, I agree now that both functions can block.

quoted

What's the deadlock scenario you envision?

OK, filp_close() does call f_op->flush(). So we have this call
here and there aren't problems today. I still say this is a
problem waiting to occur, but I guess I can live with it.

If filp_close() already calls f_op->flush(), why do we need an
explicit vfs_fsync() there?

->flush doesn't do anything on some filesystems, and while it does
return an error code today, it should have been a void return function.
The error from it can't be counted on.

OK. The goal is detecting writeback errors, and ->flush is not a
reliable way to do that.

vfs_fsync is what ensures that everything gets written back and returns
info about writeback errors. I don't see a way around calling it at
least once before we close a file, if we want to keep up the "trick" of
resetting the verifier when we see errors.

Fair enough, but maybe it's not worth a complexity and performance
impact. Writeback errors are supposed to be rare compared to I/O.

If we can find another way (especially, a more reliable way) to
monitor for writeback errors, that might be an improvement over
using vfs_fsync, which seems heavyweight no matter how you cut it.

IMO, the goal ought to be to ensure that we don't end up having to do
any writeback when we get to GC'ing it, and that's what patch 4/4 should
do.

Understood. I don't disagree with the goal, since filp_close()
won't be able to avoid a potentially costly ->flush in some cases.

The issue is even a SYNC_NONE flush has appreciable cost.


--
Chuck Lever

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help