Re: Regarding ext4 extent allocation strategy
From: Shyam Prasad N <hidden>
Date: 2021-07-13 12:57:52
On Tue, Jul 13, 2021 at 5:09 PM Theodore Y. Ts'o [off-list ref] wrote:
On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:quoted
Our team in Microsoft, which works on the Linux SMB3 client kernel filesystem has recently been exploring the use of fscache on top of ext4 for caching the network filesystem data for some customer workloads. However, the maintainer of fscache (David Howells) recently warned us that a few other extent based filesystem developers pointed out a theoretical bug in the current implementation of fscache/cachefiles. It currently does not maintain a separate metadata for the cached data it holds, but instead uses the sparseness of the underlying filesystem to track the ranges of the data that is being cached. The bug that has been pointed out with this is that the underlying filesystems could bridge holes between data ranges with zeroes or punch hole in data ranges that contain zeroes. (@David please add if I missed something). David has already begun working on the fix to this by maintaining the metadata of the cached ranges in fscache itself. However, since it could take some time for this fix to be approved and then backported by various distros, I'd like to understand if there is a potential problem in using fscache on top of ext4 without the fix. If ext4 doesn't do any such optimizations on the data ranges, or has a way to disable such optimizations, I think we'll be okay to use the older versions of fscache even without the fix mentioned above.Yes, the tuning knob you are looking for is: What: /sys/fs/ext4/<disk>/extent_max_zeroout_kb Date: August 2012 Contact: "Theodore Ts'o" [off-list ref] Description: The maximum number of kilobytes which will be zeroed out in preference to creating a new uninitialized extent when manipulating an inode's extent tree. Note that using a larger value will increase the variability of time necessary to complete a random write operation (since a 4k random write might turn into a much larger write due to the zeroout operation). (From Documentation/ABI/testing/sysfs-fs-ext4) The basic idea here is that with a random workload, with HDD's, the cost of writing a 16k random write is not much more than the time to write a 4k random write; that is, the cost of HDD seeks dominates. There is also a cost in having a many additional entries in the extent tree. So if we have a fallocated region, e.g: +-------------+---+---+---+----------+---+---+---------+ ... + Uninit (U) | W | U | W | Uninit | W | U | Written | ... +-------------+---+---+---+----------+---+---+---------+ It's more efficient to have the extent tree look like this +-------------+-----------+----------+---+---+---------+ ... + Uninit (U) | Written | Uninit | W | U | Written | ... +-------------+-----------+----------+---+---+---------+ And just simply write zeros to the first "U" in the above figure. The default value of extent_max_zeroout_kb is 32k. This optimization can be disabled by setting extent_max_zeroout_kb to 0. The downside of this is a potential degredation of a random write workload (using for example the fio benchmark program) on that file system. Cheers, - Ted
Hi Ted, Thanks for pointing this out. We'll look into the use of this option. Also, is this parameter also respected when a hole is punched in the middle of an allocated data extent? i.e. is there still a possibility that a punched hole does not translate to splitting the data extent, even when extent_max_zeroout_kb is set to 0? -- Regards, Shyam