Re: Regarding ext4 extent allocation strategy

From: Shyam Prasad N <hidden>
Date: 2021-07-13 12:57:52

On Tue, Jul 13, 2021 at 5:09 PM Theodore Y. Ts'o [off-list ref] wrote:

On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:

quoted

Our team in Microsoft, which works on the Linux SMB3 client kernel
filesystem has recently been exploring the use of fscache on top of
ext4 for caching the network filesystem data for some customer
workloads.

However, the maintainer of fscache (David Howells) recently warned us
that a few other extent based filesystem developers pointed out a
theoretical bug in the current implementation of fscache/cachefiles.
It currently does not maintain a separate metadata for the cached data
it holds, but instead uses the sparseness of the underlying filesystem
to track the ranges of the data that is being cached.
The bug that has been pointed out with this is that the underlying
filesystems could bridge holes between data ranges with zeroes or
punch hole in data ranges that contain zeroes. (@David please add if I
missed something).

David has already begun working on the fix to this by maintaining the
metadata of the cached ranges in fscache itself.
However, since it could take some time for this fix to be approved and
then backported by various distros, I'd like to understand if there is
a potential problem in using fscache on top of ext4 without the fix.
If ext4 doesn't do any such optimizations on the data ranges, or has a
way to disable such optimizations, I think we'll be okay to use the
older versions of fscache even without the fix mentioned above.

Yes, the tuning knob you are looking for is:

What:           /sys/fs/ext4/<disk>/extent_max_zeroout_kb
Date:           August 2012
Contact:        "Theodore Ts'o" [off-list ref]
Description:
                The maximum number of kilobytes which will be zeroed
                out in preference to creating a new uninitialized
                extent when manipulating an inode's extent tree.  Note
                that using a larger value will increase the
                variability of time necessary to complete a random
                write operation (since a 4k random write might turn
                into a much larger write due to the zeroout
                operation).

(From Documentation/ABI/testing/sysfs-fs-ext4)

The basic idea here is that with a random workload, with HDD's, the
cost of writing a 16k random write is not much more than the time to
write a 4k random write; that is, the cost of HDD seeks dominates.
There is also a cost in having a many additional entries in the extent
tree.  So if we have a fallocated region, e.g:

    +-------------+---+---+---+----------+---+---+---------+
... + Uninit (U)  | W | U | W |   Uninit | W | U | Written | ...
    +-------------+---+---+---+----------+---+---+---------+

It's more efficient to have the extent tree look like this

    +-------------+-----------+----------+---+---+---------+
... + Uninit (U)  |  Written  |   Uninit | W | U | Written | ...
    +-------------+-----------+----------+---+---+---------+

And just simply write zeros to the first "U" in the above figure.

The default value of extent_max_zeroout_kb is 32k.  This optimization
can be disabled by setting extent_max_zeroout_kb to 0.  The downside
of this is a potential degredation of a random write workload (using
for example the fio benchmark program) on that file system.

Cheers,

                                                - Ted

Hi Ted,

Thanks for pointing this out. We'll look into the use of this option.

Also, is this parameter also respected when a hole is punched in the
middle of an allocated data extent? i.e. is there still a possibility
that a punched hole does not translate to splitting the data extent,
even when extent_max_zeroout_kb is set to 0?

-- 
Regards,
Shyam

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help