Re: Linux-next parallel cp workload hang

From: Xiong Zhou <hidden>
Date: 2016-05-18 08:31:50
Also in: linux-fsdevel, lkml

Hi,

On Wed, May 18, 2016 at 03:56:34PM +1000, Dave Chinner wrote:

On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote:

quoted

Hi,

Parallel cp workload (xfstests generic/273) hangs like blow.
It's reproducible with a small chance, less the 1/100 i think.

Have hit this in linux-next 20160504 0506 0510 trees, testing on
xfs with loop or block device. Ext4 survived several rounds
of testing.

Linux next 20160510 tree hangs within 500 rounds testing several
times. The same tree with vfs parallel lookup patchset reverted
survived 900 rounds testing. Reverted commits are attached.  >

What hardware?

A HP prototype host.

quoted

Bisecting in this patchset ided this commit:

3b0a3c1ac1598722fc289da19219d14f2a37b31f is the first bad commit
commit 3b0a3c1ac1598722fc289da19219d14f2a37b31f
Author: Al Viro [off-list ref]
Date:   Wed Apr 20 23:42:46 2016 -0400

    simple local filesystems: switch to ->iterate_shared()
    
    no changes needed (XFS isn't simple, but it has the same parallelism
    in the interesting parts exercised from CXFS).
    
With this commit reverted on top of Linux next 0510 tree, 5000+ rounds
of testing passed.

Although 2000 rounds testing had been conducted before good/bad
verdict, i'm not 100 percent sure about all this, since it's
so hard to hit, and i am not that lucky..

Bisect log and full blocked state process dump log are also attached.

Furthermore, this was first hit when testing fs dax on nvdimm,
however it's reproducible without dax mount option, and also
reproducible on loop device, just seems harder to hit.

Thanks,
Xiong

[0.771475] INFO: task cp:49033 blocked for more than 120 seconds.
[0.794263]       Not tainted 4.6.0-rc6-next-20160504 #5
[0.812515] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[0.841801] cp              D ffff880b4e977928     0 49033  49014
0x00000080
[0.868923]  ffff880b4e977928 ffff880ba275d380 ffff880b8d712b80
ffff880b4e978000
[0.897504]  7fffffffffffffff 0000000000000002 0000000000000000
ffff880b8d712b80
[0.925234]  ffff880b4e977940 ffffffff816cbc25 ffff88035a1dabb0
ffff880b4e9779e8
[0.953237] Call Trace:
[0.958314]  [<ffffffff816cbc25>] schedule+0x35/0x80
[0.974854]  [<ffffffff816ceb71>] schedule_timeout+0x231/0x2d0
[0.995728]  [<ffffffff810d2e2d>] ? down_trylock+0x2d/0x40
[1.015351]  [<ffffffffa03fa6c2>] ? xfs_iext_bno_to_ext+0xa2/0x190 [xfs]
[1.040182]  [<ffffffff816cdb89>] __down_common+0xaa/0x104
[1.059021]  [<ffffffffa04076d2>] ? _xfs_buf_find+0x162/0x340 [xfs]
[1.081357]  [<ffffffff816cdc00>] __down+0x1d/0x1f
[1.097166]  [<ffffffff810d2e81>] down+0x41/0x50
[1.112869]  [<ffffffffa04074bc>] xfs_buf_lock+0x3c/0xf0 [xfs]
[1.134504]  [<ffffffffa04076d2>] _xfs_buf_find+0x162/0x340 [xfs]
[1.156871]  [<ffffffffa04078da>] xfs_buf_get_map+0x2a/0x270 [xfs]

So what's holding that directory data buffer lock? It should only be
held if there is either IO in progress, or a modification of the
buffer in progress that is blocked somewhere else.

quoted

[1.180010]  [<ffffffffa04084ed>] xfs_buf_read_map+0x2d/0x180 [xfs]
[1.203538]  [<ffffffffa0435f21>] xfs_trans_read_buf_map+0xf1/0x300 [xfs]
[1.229194]  [<ffffffffa03e75b1>] xfs_da_read_buf+0xd1/0x100 [xfs]
[1.251948]  [<ffffffffa03ed5e6>] xfs_dir3_data_read+0x26/0x60 [xfs]
[1.275736]  [<ffffffffa040923e>] xfs_dir2_leaf_readbuf.isra.12+0x1be/0x4a0 [xfs]
[1.305094]  [<ffffffff816cdc92>] ? down_read+0x12/0x30
[1.323787]  [<ffffffffa04187a4>] ? xfs_ilock+0xe4/0x110 [xfs]
[1.345114]  [<ffffffffa040965b>] xfs_dir2_leaf_getdents+0x13b/0x3d0 [xfs]
[1.371818]  [<ffffffffa0409f26>] xfs_readdir+0x1a6/0x1c0 [xfs]

So we should be holding the ilock in shared mode here...

quoted

[1.393471]  [<ffffffffa040bcbb>] xfs_file_readdir+0x2b/0x30 [xfs]
[1.416874]  [<ffffffff8122cec3>] iterate_dir+0x173/0x190
[1.436709]  [<ffffffff81003276>] ? do_audit_syscall_entry+0x66/0x70
[1.460951]  [<ffffffff8122d338>] SyS_getdents+0x98/0x120
[1.480566]  [<ffffffff8122cee0>] ? iterate_dir+0x190/0x190
[1.500909]  [<ffffffff81003b12>] do_syscall_64+0x62/0x110
[1.520847]  [<ffffffff816cf9a1>] entry_SYSCALL64_slow_path+0x25/0x25
[1.545372] INFO: task cp:49040 blocked for more than 120 seconds.
[1.568933]       Not tainted 4.6.0-rc6-next-20160504 #5
[1.587943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[1.618544] cp              D ffff880b91463b00     0 49040  49016
0x00000080
[1.645502]  ffff880b91463b00 ffff880464d5c140 ffff88029b475700
ffff880b91464000
[1.674145]  ffff880411c42610 ffffffffffffffff ffff880411c42628
ffff8802c10bc610
[1.702834]  ffff880b91463b18 ffffffff816cbc25 ffff88029b475700
ffff880b91463b88
[1.731501] Call Trace:
[1.736866]  [<ffffffff816cbc25>] schedule+0x35/0x80
[1.754119]  [<ffffffff816ce3c2>] rwsem_down_read_failed+0xf2/0x140
[1.777411]  [<ffffffffa0418800>] ? xfs_ilock_data_map_shared+0x30/0x40
[xfs]
[1.805090]  [<ffffffff81355e08>] call_rwsem_down_read_failed+0x18/0x30
[1.830482]  [<ffffffff816cdca0>] down_read+0x20/0x30
[1.848505]  [<ffffffffa04187a4>] xfs_ilock+0xe4/0x110 [xfs]
[1.869293]  [<ffffffffa0418800>] xfs_ilock_data_map_shared+0x30/0x40

And it this is an attempt to lock the inode shared, so if that is
failing while there's another shared holder, than means there's an
exclusive waiter queued up (i.e. read iheld -> write blocked -> read
blocked).


So looking at dump-g273xfs0510:

[  845.727907] INFO: task cp:40126 blocked for more than 120 seconds.
[  845.751175]       Not tainted 4.6.0-rc7-next-20160510 #9
[  845.770011] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  845.799967] cp              D ffff88046b64bcb8     0 40126  40106 0x00000080
[  845.826636]  ffff88046b64bcb8 ffffffffa0458d36 ffff8804685c2b80 0000000002408240
[  845.855232]  ffff88046b64c000 ffff880426b08450 ffff880426b08450 ffff880426b08468
[  845.883836]  ffff880426b08458 ffff88046b64bcd0 ffffffff816ce9c5 ffff8804685c2b80
[  845.911854] Call Trace:
[  845.917076]  [<ffffffffa0458d36>] ? kmem_zone_alloc+0x96/0x120 [xfs]
[  845.940302]  [<ffffffff816ce9c5>] schedule+0x35/0x80
[  845.957118]  [<ffffffff816d1534>] rwsem_down_write_failed+0x114/0x230
[  845.981242]  [<ffffffffa045a294>] ? xlog_grant_head_check+0x54/0x100 [xfs]
[  846.007673]  [<ffffffffa04492e4>] ? xfs_vn_update_time+0xd4/0x160 [xfs]
[  846.032009]  [<ffffffff81358217>] call_rwsem_down_write_failed+0x17/0x30
[  846.056909]  [<ffffffff816d0a7d>] down_write+0x2d/0x40
[  846.074580]  [<ffffffffa044b78f>] xfs_ilock+0xcf/0x110 [xfs]
[  846.094681]  [<ffffffffa04492e4>] xfs_vn_update_time+0xd4/0x160 [xfs]
[  846.118308]  [<ffffffff81238648>] touch_atime+0xa8/0xd0
[  846.136385]  [<ffffffff8122f132>] iterate_dir+0x112/0x190
[  846.155995]  [<ffffffff8122f608>] SyS_getdents+0x98/0x120
[  846.175819]  [<ffffffff8122f1b0>] ? iterate_dir+0x190/0x190
[  846.195808]  [<ffffffff81003b12>] do_syscall_64+0x62/0x110
[  846.215019]  [<ffffffff816d2761>] entry_SYSCALL64_slow_path+0x25/0x25


Timestamp updates are transactional, requiring exclusive access to
the directory ilock. That's why we have a train-smash of down_read()
calls blocking in these dumps - they are queued up behind this call.
This is expected under workloads where a couple of hundred threads
are all reading the same directory and trying to open files in it.

So why isn't the current read holder making progress? Why is it
blocked on a directory data buffer lock when the only accesses
should be read only?

There's nothing in either of the stack traces that tells me this. I
suspect that there may be some kind of buffer readahead vs buffer
read interaction that results in leaking a buffer lock, but it is
not obvious what that may be yet.

Can you reproduce this with CONFIG_XFS_DEBUG=y set? if you can, and
it doesn't trigger any warnings or asserts, can you then try to
reproduce it while tracing the following events:

	xfs_buf_lock
	xfs_buf_lock_done
	xfs_buf_trylock
	xfs_buf_unlock

So we might be able to see if there's an unexpected buffer
locking/state pattern occurring when the hang occurs?

Yes, i've reproduced this with both CONFIG_XFS_DEBUG=y and the tracers
on. There are some trace output after hang for a while. Tracing output,
dmesg, block stack dump and config are attached. BTW, reproduced on
mainline tree this time, to commit 0b7962a.

Also, if you run on slower storage, does the hang get harder or
easier to hit?

Harder to me.

From easy to hard, or hard to harder, they are nvdimm device with dax

mount option , then nvdimm device without dax mount option, lood device.

Thanks,
Xiong

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

Attachments

g273-mainline.tar.gz [application/gzip] 139695 bytes

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help