Thread (26 messages) 26 messages, 3 authors, 2022-11-01

Re: [PATCH v3 3/4] nfsd: close race between unhashing and LRU addition

From: Chuck Lever III <chuck.lever@oracle.com>
Date: 2022-10-31 02:51:25

On Oct 30, 2022, at 5:45 PM, NeilBrown [off-list ref] wrote:

On Sat, 29 Oct 2022, Chuck Lever III wrote:
quoted
quoted
On Oct 28, 2022, at 2:57 PM, Jeff Layton [off-list ref] wrote:

The list_lru_add and list_lru_del functions use list_empty checks to see
whether the object is already on the LRU. That's fine in most cases, but
we occasionally repurpose nf_lru after unhashing. It's possible for an
LRU removal to remove it from a different list altogether if we lose a
race.
Can that issue be resolved by simply adding a "struct list_head nf_dispose"
field? That might be more straightforward than adding conditional logic.

quoted
I've never seen that happen. lru field re-use is actually used in other
places in the kernel. Shouldn't we try to find and fix such races?

Wasn't the idea to reduce the complexity of nfsd_file_put ?
I think the nfsd filecache is different from those "other places"
because of nfsd_file_close_inode() and related code.  If I understand
correctly, nfsd can remove a file from the filecache while it is still
in use.
Not sure about that; I think nfsd_file_close_inode() is invoked only
when a file is deleted. I could be remembering incorrectly, but that
seems like a really difficult race to hit.

In this case it needs to be unhashed and removed from the lru -
and then added to a dispose list - while it might still be active for
some IO request.

Prior to Commit 8d0d254b15cc ("nfsd: fix nfsd_file_unhash_and_dispose")
unhash_and_dispose() wouldn't add to the dispose list unless the
refcount was one.  I'm not sure how racy this test was, but it would
mean that it is unlikely for an nfsd_file to be added to the dispose list
if it was still in use.

After that commit it seems more likely that a nfsd_file will be added to
a dispose list while it is in use.
After it's linked to a dispose list via nf_lru, list_lru_add won't put
it on the LRU -- it becomes a no-op because nf_lru is now !empty. I
think we would have seen LRU corruption pretty quickly. Re-reading
Jeff's patch description, that might not be the problem he's trying
to address here.

But, it would be easy to do some reality testing. I think you could
add a WARN_ON or tracepoint in nfsd_file_free() or somewhere in the
dispose-list path to catch an in-use nfsd_file?

As we add/remove things to a dispose list without a lock, we need to be
careful that no other thread will add the nfsd_file to an lru at the
same time.  refcounts will no longer provide that protection.  Maybe we
should restore the refcount protection, or maybe we need a bit as Jeff
has added here.
I'm not opposed to defensive changes, in general. This one seems to be
adding significant complexity without a clear hazard. I'd like to have
a better understanding of exactly what misbehavior is being addressed.


--
Chuck Lever


Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help