Re: [RFC][PATCH 0/8] xfstests: rework large filesystem testing

From: Dave Chinner <david@fromorbit.com>
Date: 2011-11-22 09:28:43
Also in: linux-fsdevel, linux-xfs

On Mon, Nov 21, 2011 at 07:10:45AM -0500, Theodore Tso wrote:

On Nov 21, 2011, at 6:31 AM, Dave Chinner wrote:

quoted

ext4, however, still has serious issues with this - either we take
the mkfs.ext4 time hit to initialise all the block groups, or we
take it during the preallocation.  IOWs, the "don't do work at mkfs
but do it after mount" hack^Wtradeoff simply does not work for
testing large filesystems in this manner.  While it is possible to
run large filesystem tests on ext4 using this mechanism, it is
extremely painful to do so.

For testing, we can disable the "do it after the mount " aspect
of  ext4 by using the mount option "noinit_itable".   We basically
only need to zero the inode table to make sure e2fsck doesn't
confuse old inode tables as new ones in the event that the block

It's not the deferred inode table initialisation that is the problem
for the preallocation immediately after a mkfs and mount - it's
initialising block groups that is the problem:

363806.042907] SysRq : Show Blocked State
[363806.044586]   task                        PC stack   pid father
[363806.046400] xfs_io          D ffff8801099aed08     0  7264   7064 0x00000000
[363806.046400]  ffff880117e33868 0000000000000086 0000000000000000 ffffffffb13a2903
[363806.046400]  ffff8801099ae980 ffff880117e33fd8 ffff880117e33fd8 ffff880117e33fd8
[363806.046400]  ffff88011afb44c0 ffff8801099ae980 ffff880117e33868 00000001810b59ed
[363806.046400] Call Trace:
[363806.046400]  [<ffffffff8118eec0>] ? __wait_on_buffer+0x30/0x30
[363806.046400]  [<ffffffff81aab3af>] schedule+0x3f/0x60
[363806.046400]  [<ffffffff81aab45f>] io_schedule+0x8f/0xd0
[363806.046400]  [<ffffffff8118eece>] sleep_on_buffer+0xe/0x20
[363806.046400]  [<ffffffff81aabc2f>] __wait_on_bit+0x5f/0x90
[363806.046400]  [<ffffffff8167e177>] ? generic_make_request+0xc7/0x100
[363806.046400]  [<ffffffff8118eec0>] ? __wait_on_buffer+0x30/0x30
[363806.046400]  [<ffffffff81aabcdc>] out_of_line_wait_on_bit+0x7c/0x90
[363806.046400]  [<ffffffff810ac360>] ? autoremove_wake_function+0x40/0x40
[363806.046400]  [<ffffffff8118eebe>] __wait_on_buffer+0x2e/0x30
[363806.046400]  [<ffffffff812824c3>] ext4_mb_init_cache+0x223/0x9c0
[363806.046400]  [<ffffffff81118583>] ? add_to_page_cache_locked+0xb3/0x100
[363806.046400]  [<ffffffff81282dae>] ext4_mb_init_group+0x14e/0x210
[363806.046400]  [<ffffffff812832d9>] ext4_mb_load_buddy+0x339/0x350
[363806.046400]  [<ffffffff8128465b>] ext4_mb_find_by_goal+0x6b/0x2b0
[363806.046400]  [<ffffffff81285034>] ext4_mb_regular_allocator+0x64/0x430
[363806.046400]  [<ffffffff81286d8d>] ext4_mb_new_blocks+0x40d/0x560
[363806.046400]  [<ffffffff81aad1ee>] ? _raw_spin_lock+0xe/0x20
[363806.046400]  [<ffffffff81aad1ee>] ? _raw_spin_lock+0xe/0x20
[363806.046400]  [<ffffffff8127c6a1>] ext4_ext_map_blocks+0xfa1/0x1d10
[363806.046400]  [<ffffffff8129a6aa>] ? jbd2__journal_start+0xca/0x110
[363806.046400]  [<ffffffff81252535>] ext4_map_blocks+0x1b5/0x280
[363806.046400]  [<ffffffff8127ddf5>] ext4_fallocate+0x1c5/0x530
[363806.046400]  [<ffffffff8115e992>] do_fallocate+0xf2/0x160
[363806.046400]  [<ffffffff8115ea4b>] sys_fallocate+0x4b/0x70
[363806.046400]  [<ffffffff81ab5082>] system_call_fastpath+0x16/0x1b

this initialisation runs at about 50MB/s for some periods of the
preallocation. Sample from iostat -d -x -m 5:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00  1352.30   46.91  178.64     0.18    51.44   468.74     4.38   19.42   18.57   19.64   4.00  90.30

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00  1405.40   47.20  184.40     0.18    50.97   452.34     5.99   25.84   18.31   27.77   3.91  90.56

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00  4302.40   38.60  377.40     0.15    57.49   283.79    31.68   76.17   23.50   81.55   2.20  91.68

shows it is close to IO bound. This in on a 12 disk RAID-0 array w/
a 512MB BBWC. That indicates that most of the IO being done is
random. perf top shows that that the limited amount of CPU time
being spent is distributed like this:

            samples  pcnt function                      DSO
             _______ _____ _____________________________ _________________

               83.00  6.7% ext4_init_block_bitmap        [kernel.kallsyms]
               82.00  6.6% crc16                         [kernel.kallsyms]
               73.00  5.9% __find_get_block              [kernel.kallsyms]
               65.00  5.2% ext4_num_overhead_clusters    [kernel.kallsyms]
               62.00  5.0% ext4_set_bits                 [kernel.kallsyms]
               56.00  4.5% ext4_ext_find_extent          [kernel.kallsyms]
               55.00  4.4% ext4_mark_iloc_dirty          [kernel.kallsyms]
               53.00  4.3% jbd2_journal_add_journal_head [kernel.kallsyms]
               50.00  4.0% do_get_write_access           [kernel.kallsyms]
               45.00  3.6% mb_find_order_for_block       [kernel.kallsyms]
               41.00  3.3% ext4_ext_map_blocks           [kernel.kallsyms]
               34.00  2.7% jbd2_journal_cancel_revoke    [kernel.kallsyms]
               28.00  2.3% jbd2_journal_dirty_metadata   [kernel.kallsyms]
               27.00  2.2% jbd2_journal_put_journal_head [kernel.kallsyms]

The rest of the time, there is no IO and the preallocation is
is severely CPU bound. Top shows:

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7523 root      20   0 10848  792  636 R   99  0.0   0:23.39 xfs_io

and perf top -p <pid of xfs_io> shows:

            samples  pcnt function                    DSO
             _______ _____ ___________________________ _________________

            13840.00 89.2% ext4_mb_good_group          [kernel.kallsyms]
             1218.00  7.8% ext4_mb_regular_allocator   [kernel.kallsyms]
              148.00  1.0% mb_find_order_for_block     [kernel.kallsyms]
               85.00  0.5% find_next_zero_bit          [kernel.kallsyms]
               78.00  0.5% radix_tree_lookup_element   [kernel.kallsyms]
               54.00  0.3% find_get_page               [kernel.kallsyms]
               53.00  0.3% mb_find_extent.constprop.31 [kernel.kallsyms]
               12.00  0.1% mb_find_buddy               [kernel.kallsyms]
               10.00  0.1% ext4_mb_load_buddy          [kernel.kallsyms]

which, if I read the code correctly, is CPU bound searching for a
block group to allocate from.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help