Re: [PATCH] [fuse] alloc_page nofs avoid deadlock

From: Miklos Szeredi <miklos@szeredi.hu>
Date: 2021-09-24 07:52:49
Also in: lkml

On Fri, 24 Sept 2021 at 05:52, Ed Tsai [off-list ref] wrote:

On Wed, 2021-08-18 at 17:24 +0800, Miklos Szeredi wrote:

quoted

On Tue, 13 Jul 2021 at 04:42, Ed Tsai [off-list ref] wrote:

quoted

On Tue, 2021-06-08 at 17:30 +0200, Miklos Szeredi wrote:

quoted

On Thu, 3 Jun 2021 at 14:52, chenguanyou <
chenguanyou9338@gmail.com>
wrote:

quoted

ABA deadlock

PID: 17172 TASK: ffffffc0c162c000 CPU: 6 COMMAND: "Thread-21"
0 [ffffff802d16b400] __switch_to at ffffff8008086a4c
1 [ffffff802d16b470] __schedule at ffffff80091ffe58
2 [ffffff802d16b4d0] schedule at ffffff8009200348
3 [ffffff802d16b4f0] bit_wait at ffffff8009201098
4 [ffffff802d16b510] __wait_on_bit at ffffff8009200a34
5 [ffffff802d16b5b0] inode_wait_for_writeback at
ffffff800830e1e8
6 [ffffff802d16b5e0] evict at ffffff80082fb15c
7 [ffffff802d16b620] iput at ffffff80082f9270
8 [ffffff802d16b680] dentry_unlink_inode at ffffff80082f4c90
9 [ffffff802d16b6a0] __dentry_kill at ffffff80082f1710
10 [ffffff802d16b6d0] shrink_dentry_list at ffffff80082f1c34
11 [ffffff802d16b750] prune_dcache_sb at ffffff80082f18a8
12 [ffffff802d16b770] super_cache_scan at ffffff80082d55ac
13 [ffffff802d16b860] shrink_slab at ffffff8008266170
14 [ffffff802d16b900] shrink_node at ffffff800826b420
15 [ffffff802d16b980] do_try_to_free_pages at ffffff8008268460
16 [ffffff802d16ba60] try_to_free_pages at ffffff80082680d0
17 [ffffff802d16bbe0] __alloc_pages_nodemask at
ffffff8008256514
18 [ffffff802d16bc60] fuse_copy_fill at ffffff8008438268
19 [ffffff802d16bd00] fuse_dev_do_read at ffffff8008437654
20 [ffffff802d16bdc0] fuse_dev_splice_read at ffffff8008436f40
21 [ffffff802d16be60] sys_splice at ffffff8008315d18
22 [ffffff802d16bff0] __sys_trace at ffffff8008084014

PID: 9652 TASK: ffffffc0c9ce0000 CPU: 4 COMMAND:
"kworker/u16:8"
0 [ffffff802e793650] __switch_to at ffffff8008086a4c
1 [ffffff802e7936c0] __schedule at ffffff80091ffe58
2 [ffffff802e793720] schedule at ffffff8009200348
3 [ffffff802e793770] __fuse_request_send at ffffff8008435760
4 [ffffff802e7937b0] fuse_simple_request at ffffff8008435b14
5 [ffffff802e793930] fuse_flush_times at ffffff800843a7a0
6 [ffffff802e793950] fuse_write_inode at ffffff800843e4dc
7 [ffffff802e793980] __writeback_single_inode at
ffffff8008312740
8 [ffffff802e793aa0] writeback_sb_inodes at ffffff80083117e4
9 [ffffff802e793b00] __writeback_inodes_wb at ffffff8008311d98
10 [ffffff802e793c00] wb_writeback at ffffff8008310cfc
11 [ffffff802e793d00] wb_workfn at ffffff800830e4a8
12 [ffffff802e793d90] process_one_work at ffffff80080e4fac
13 [ffffff802e793e00] worker_thread at ffffff80080e5670
14 [ffffff802e793e60] kthread at ffffff80080eb650

The issue is real.

The fix, however, is not the right one.  The fundamental problem
is
that fuse_write_inode() blocks on a request to userspace.

This is the same issue that fuse_writepage/fuse_writepages
face.  In
that case the solution was to copy the page contents to a
temporary
buffer and return immediately as if the writeback already
completed.

Something similar needs to be done here: send the FUSE_SETATTR
request
asynchronously and return immediately from
fuse_write_inode().  The
tricky part is to make sure that multiple time updates for the
same
inode aren't mixed up...

Thanks,
Miklos

Dear Szeredi,

Writeback thread calls fuse_write_inode() and wait for user Daemon
to
complete this write inode request. The user daemon will
alloc_page()
after taking this request, and a deadlock could happen when we try
to
shrink dentry list under memory pressure.

We (Mediatek) glad to work on this issue for mainline and also LTS.
So
another problem is that we should not change the protocol or
feature
for stable kernel.

Use GFP_NOFS | __GFP_HIGHMEM can really avoid this by skip the
dentry
shirnker. It works but degrade the alloc_page success rate. In a
more
fundamental way, we could cache the contents and return
immediately.
But how to ensure the request will be done successfully, e.g.,
always
retry if it fails from daemon.

Key is where the the dirty metadata is flushed.  To prevent deadlock
it must not be flushed from memory reclaim, so must make sure that it
is flushed on close(2) and munmap(2) and not dirtied after that.

I'm working on this currently and hope to get it ready for the next
merge window.

Thanks,
Miklos

Hi Miklos,

I'm not sure whether it has already been resolved in mainline.
If it still WIP, please cc me on future emails.

Hi,

This is taking a bit longer, unfortunately, but I already have
something in testing and currently cleaning it up for review.  Hope to
post a series today or early next week.

Thanks,
Miklos

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help