Re: [PATCH] mke2fs: Add extended option for prezeroed storage devices
From: Sarthak Kukreti <hidden>
Date: 2021-09-27 10:43:58
Thanks for reviewing the patch, Andreas! On Tue, Sep 21, 2021 at 2:39 PM Andreas Dilger [off-list ref] wrote:
On Sep 20, 2021, at 9:42 PM, Sarthak Kukreti [off-list ref] wrote:quoted
is From: Sarthak Kukreti <redacted>
...
quoted
Additionally, on thinly provisioned storage devices (like Ceph, dm-thin),... and newly-created sparse loopback files
Thanks for pointing that out, added to the commit message in v2. ...
quoted
Testing on ChromeOS (running linux kernel 4.19) with dm-thin and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>': - Time taken by mke2fs drops from 1.07s to 0.08s. - Avoiding zeroing out the inode table and journal reduces the initial metadata space allocation from 0.48% to 0.01%. - Lazy inode table zeroing results in a further 1.45% of logical volume space getting allocated for inode tables, even if not file data is added to the filesystem. With assume_storage_prezeroed, the metadata allocation remains at 0.01%.This seems beneficial, but I'm wondering if this could also be done automatically when TRIM/DISCARD is used by mke2fs to erase a device? One safe option to do this automatically would be to start by *reading* the disk blocks and check if they are all zero, and only switch to zero-block writes if any block is found with non-zero data. That would avoid the extra space usage from zero-block writes in the above cases, and also work for the huge majority of users that won't know the "assume_storage_prezeroed" option even exits, though it won't necessarily reduce the runtime.
I agree with Ted (quoting a reply on a forked thread below) that reading all inode table blocks on the device will slow down mke2fs a lot depending on the storage medium and size. Maybe it can be done instead at first mount in conjunction with lazy_itable_init ie. ext4 reads the block and only issues a zero-out if the block is not already zero? Even so, an explicit hint would be compatible with this approach: it avoids (unnecessarily) reading through all the inode table blocks as long as the hint was passed at creation time. On Wed, Sep 22, 2021 at 8:57 PM Theodore Ts'o [off-list ref] wrote:
The problem is mke2fs really does need to care about the performance of discard or write same. Users want mke2fs to be fast, especially during the distro installation process. That's why we implemented the lazy inode table initialization feature in the first place. So reading all each block from the inode table to see if it's zero might be slow, and so we might be better off just doing the lazy itable init instead.
...
quoted
+ if (assume_storage_prezeroed) { + if (verbose) + printf("%s", + _("Assuming the storage device is prezeroed " + "- skipping inode table and journal wipe\n")); + + lazy_itable_init = 1; + itable_zeroed = 1; + zero_hugefile = 0; + journal_flags |= EXT2_MKJOURNAL_LAZYINIT; + }Indentation appears to be broken here - only 2 spaces instead of a tab. This is also missing any kind of test case. Since a large number of the e2fsck test cases are using loopback filesystems created on a sparse file, this would both be good test cases, as well as reducing time/space used during testing.
Oops, thanks for catching that! Fixed in v2 and I added a test case for this option. I was playing around with adding the option as a default to tests/mke2fs.conf.in; that didn't affect the overall test run time much (a lot of the tests seem to be dd'ing entire files and not using sparse files). Best Sarthak