Re: [PATCH v5 3/5] nfsd: rework refcounting in filecache

From: Chuck Lever III <chuck.lever@oracle.com>
Date: 2023-01-18 18:42:12

On Jan 18, 2023, at 12:18 PM, Jeff Layton [off-list ref] wrote:

On Wed, 2023-01-18 at 16:48 +0000, Shachar Kagan wrote:

quoted

On Wend, 2023-01-18 at 18:45 +0000, Chuck Lever III wrote:

quoted

On Tue, 2023-01-17 at 15:22 +0000, Chuck Lever III wrote:

quoted

On Jan 17, 2023, at 10:16 AM, Jason Gunthorpe [off-list ref] wrote:

On Tue, Nov 01, 2022 at 10:46:45AM -0400, Jeff Layton wrote:

quoted

The filecache refcounting is a bit non-standard for something 
searchable by RCU, in that we maintain a sentinel reference while 
it's hashed. This in turn requires that we have to do things differently in the "put"
depending on whether its hashed, which we believe to have led to races.

There are other problems in here too. nfsd_file_close_inode_sync 
can end up freeing an nfsd_file while there are still outstanding 
references to it, and there are a number of subtle ToC/ToU races.

Rework the code so that the refcount is what drives the lifecycle. 
When the refcount goes to zero, then unhash and rcu free the object.

With this change, the LRU carries a reference. Take special care 
to deal with it when removing an entry from the list.

Signed-off-by: Jeff Layton <jlayton@kernel.org>

Our test team is getting crashes that bisection pointed at this 
patch. It seems like there are multiple parallel crash reports so 
the whole thing is a mess to read:

[  875.548965] BUG: kernel NULL pointer dereference, address: 
00000000000000d0 [  875.548968] ------------[ cut here ]------------ 
[  875.548972] refcount_t: underflow; use-after-free.
[  875.548992] WARNING: CPU: 4 PID: 12145 at lib/refcount.c:28 
refcount_warn_saturate+0xd8/0xe0 [  875.549851] #PF: supervisor read 
access in kernel mode [  875.550158] Modules linked in:
[  875.550752] #PF: error_code(0x0000) - not-present page [  
875.551269]  nfsd [  875.551878] PGD 0 [  875.552069]  iptable_raw [  
875.552677] P4D 0 [  875.552824]  bonding mlx5_vfio_pci [  
875.553095] [  875.553255]  rdma_ucm ipip [  875.553525] Oops: 0000 
[#1] SMP [  875.553733]  tunnel4 [  875.553941] CPU: 0 PID: 12147 
Comm: nfsd Not tainted 6.1.0-rc7_ac3a2585f018 #1 [  875.554109]  
ip_gre ib_umad [  875.554517] Hardware name: QEMU Standard PC (Q35 + 
ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 
04/01/2014 [  875.554656]  nf_tables vfio_pci [  875.555508] RIP: 
0010:vfs_setlease+0x27/0x70 [  875.555695]  vfio_pci_core 
vfio_virqfd [  875.557015] Code: ff ff 90 0f 1f 44 00 00 41 54 49 89 
d4 55 48 89 fd 48 83 ec 10 48 85 d2 74 06 48 83 fe 02 75 1f 48 8b 45 
28 4c 89 e2 48 89 ef <48> 8b 80 d0 00 00 00 48 85 c0 74 2c 48 83 c4 
10 5d 41 5c ff e0 48 [  875.557209]  vfio_iommu_type1 [  875.557406] 
RSP: 0018:ffff88810378bdb0 EFLAGS: 00010246 [  875.557634]  mlx5_ib 
[  875.558446] [  875.558628]  vfio [  875.558862] RAX: 
0000000000000000 RBX: ffff88824866c000 RCX: ffff88810378bdd8 [  
875.559006]  ib_uverbs [  875.559092] RDX: 0000000000000000 RSI: 
0000000000000002 RDI: ffff88812560a200 [  875.559218]  ib_ipoib [  
875.559557] RBP: ffff88812560a200 R08: ffff8881da5ecf00 R09: 
ffffffff824064e0 [  875.559704]  mlx5_core [  875.560021] R10: 
0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [  
875.560165]  ip6_gre [  875.560488] R13: ffff8881da5ecf00 R14: 
ffff888110e62028 R15: ffff888110e621a0 [  875.560634]  gre [  
875.560959] FS:  0000000000000000(0000) GS:ffff88852c800000(0000) 
knlGS:0000000000000000 [  875.561108]  ip6_tunnel [  875.561432] CS:  
0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [  875.561554]  tunnel6 
[  875.561928] CR2: 00000000000000d0 CR3: 00000001ca27d001 CR4: 
0000000000372eb0 [  875.562084]  geneve [  875.562349] DR0: 
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [  
875.562493]  nfnetlink_cttimeout [  875.562822] DR3: 
0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [  
875.562962]  openvswitch [  875.563292] Call Trace:
[  875.563298]  <TASK>
[  875.563503]  nsh
[  875.563839]  destroy_unhashed_deleg+0x58/0xc0 [nfsd]

We are aware of this failure mode. Actually this started well before 
that particular commit.

Our problem has been that no one has been able to provide a reliable 
reproducer, so we can't figure out why it's happening. If you have a 
way to reproduce this failure reliably, can you capture a vmcore or 
enable KASAN and get a little more information?

It's possible that this crash may be related to the problem that was fixed here:

  commit 0b3a551fa58b4da941efeb209b3770868e2eddd7
  Author: Jeff Layton [off-list ref]
  Date:   Thu Jan 5 14:55:56 2023 -0500

      nfsd: fix handling of cached open files in nfsd4_open codepath

Unfortunately, that hasn't trickled into v6.1 kernels so far.

This commit is in my working tree, but this commit doesn't fix the issue since I still face the crash.
We are working on v6.2-rc3

Thanks for testing it. That patch fixes a real bug, just not the one
you're hitting apparently.

If you're comfortable working with bleeding edge kernels, you may just
want to pull in Chuck's for-rc and for-next branches.

Stephen and I renamed these yesterday to nfsd-fixes and nfsd-next,
respectively.

nfsd-fixes was pulled into v6.2-rc yesterday, fyi.

Those have a few
other patches that I wouldn't expect to change this, but might still be
worth testing to see.

If it's happening regularly, you could also try disabling leases on the
machine, at the expense of some performance. We suspect this is related
to delegation handling, but we just haven't been able to nail it down
yet. If you do that, and it seems to fix it for you, let us know as that
would be an interesting datapoint.

Thanks!
-- 
Jeff Layton [off-list ref]

--
Chuck Lever

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help