Re: [RFC PATCH] fs: ext4: don't trap kswapd and allocating tasks on ext4 inode IO

From: Jan Kara <jack@suse.cz>
Date: 2017-06-12 15:19:34
Also in: linux-fsdevel, lkml

On Mon 12-06-17 10:37:27, Johannes Weiner wrote:

On Mon, Jun 12, 2017 at 10:09:57AM +0200, Jan Kara wrote:

quoted

On Tue 16-05-17 18:03:37, Jan Kara wrote:

quoted

On Tue 16-05-17 11:41:05, Johannes Weiner wrote:

quoted

On Tue, May 16, 2017 at 04:36:45PM +0200, Jan Kara wrote:

quoted

On Mon 15-05-17 11:46:34, Johannes Weiner wrote:

quoted

We have observed across several workloads situations where kswapd and
direct reclaimers get stuck in the inode shrinker of the ext4 / mount,
causing allocation latencies across tasks in the system, while there
are dozens of gigabytes of clean page cache covering multiple disks.

The stack traces of such an instance looks like this:

[<ffffffff812b3225>] jbd2_log_wait_commit+0x95/0x110
[<ffffffff812b4f29>] jbd2_complete_transaction+0x59/0x90
[<ffffffff812668da>] ext4_evict_inode+0x2da/0x480
[<ffffffff811f2230>] evict+0xc0/0x190
[<ffffffff811f2339>] dispose_list+0x39/0x50
[<ffffffff811f323b>] prune_icache_sb+0x4b/0x60
[<ffffffff811dba71>] super_cache_scan+0x141/0x190
[<ffffffff8116e755>] shrink_slab+0x235/0x440
[<ffffffff81172b48>] shrink_zone+0x268/0x2d0
[<ffffffff81172f04>] do_try_to_free_pages+0x164/0x410
[<ffffffff81173265>] try_to_free_pages+0xb5/0x160
[<ffffffff811656b6>] __alloc_pages_nodemask+0x636/0xb30
[<ffffffff811acac8>] alloc_pages_current+0x88/0x120
[<ffffffff816d4e46>] skb_page_frag_refill+0xc6/0xf0
[<ffffffff816d4e8d>] sk_page_frag_refill+0x1d/0x80
[<ffffffff8173f86b>] tcp_sendmsg+0x28b/0xb10
[<ffffffff81769727>] inet_sendmsg+0x67/0xa0
[<ffffffff816d0488>] sock_sendmsg+0x38/0x50
[<ffffffff816d0518>] sock_write_iter+0x78/0xd0
[<ffffffff811d774e>] do_iter_readv_writev+0x5e/0xa0
[<ffffffff811d8468>] do_readv_writev+0x178/0x210
[<ffffffff811d871c>] vfs_writev+0x3c/0x50
[<ffffffff811d8782>] do_writev+0x52/0xd0
[<ffffffff811d9810>] SyS_writev+0x10/0x20
[<ffffffff81002910>] do_syscall_64+0x50/0xa0
[<ffffffff817eed3c>] return_from_SYSCALL_64+0x0/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff

The inode shrinker has provisions to skip any inodes that require
writeback, to avoid tarpitting the entire system behind a single
object when there are many other pools to recycle memory from. But
that logic doesn't cover the situation where an ext4 inode is clean
but journaled and tied to a commit that yet needs to hit the platter.

Add a superblock operation that lets the generic inode shrinker query
the filesystem whether evicting a given inode will require any IO; add
an ext4 implementation that checks whether the journal is caught up to
the commit id associated with the inode.

Fixes: 2d859db3e4a8 ("ext4: fix data corruption in inodes with journalled data")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

OK. I have to say I'm somewhat surprised you use data journalling on some
of your files / filesystems but whatever - maybe these are long symlink
after all which would make sense.

The filesystem is actually mounted data=ordered and we didn't catch
anyone in userspace enabling journaling on individual inodes. So we
assumed this must be from symlinks.

OK.

quoted

And I'm actually doubly surprised you can see these stack traces as
these days inode_lru_isolate() checks inode->i_data.nrpages and
uncommitted pages cannot be evicted from pagecache
(ext4_releasepage() will refuse to free them) so I don't see how
such inode can get to dispose_list(). But maybe the inode doesn't
really have any pages and i_datasync_tid just happens to be set to
the current transaction because it is initialized that way and we
are evicting inode that was recently read from disk.

Hm, we're running 4.6, but that already has the nrpages check in
inode_lru_isolate(). There couldn't be any pages in those inodes by
the time the shrinker gets to them.

quoted

Anyway if you add: "&& inode->i_data.nrpages" to the test in
ext4_evict_inode() do the stalls go away?

Want me to still test this?

Can you try attached patch? I'd like to confirm the theory before merging
this... Thanks!

Ping? Any result with this patch?

Sorry for not getting back earlier.

I was waiting for the internal reporter of this issue to test it, but
they have since mitigated the issue by reducing the memory footprint
of their application; bad memory health caused other problems as well.

Since this is in a production environment, they're reluctant to muck
with it now that things are working.

However, with or without a reproducer, it seems to make sense to not
serialize evict() on commits when we don't have to from a correctness
point of view. Would you be opposed to just merging the patch anyway?

No, I guess the patch makes sense even without a reproducer. I'll send it
to Ted for inclusion.

								Honza

-- 
Jan Kara [off-list ref]
SUSE Labs, CR

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help