Thread (19 messages) 19 messages, 9 authors, 2021-01-19

Re: fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten extents?

From: Dave Chinner <david@fromorbit.com>
Date: 2021-01-08 20:33:43
Also in: linux-block, linux-fsdevel, linux-xfs

On Wed, Jan 06, 2021 at 03:40:09PM -0800, Andres Freund wrote:
Hi,

On 2021-01-07 09:52:01 +1100, Dave Chinner wrote:
quoted
On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
quoted
Which brings me to $subject:

Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
doesn't convert extents into unwritten extents, but instead uses
blkdev_issue_zeroout() if supported?  Mostly interested in xfs/ext4
myself, but ...
We have explicit requests from users (think initialising large VM
images) that FALLOC_FL_ZERO_RANGE must never fall back to writing
zeroes manually.
That behaviour makes a lot of sense for quite a few use cases - I wasn't
trying to make it sound like it should not be available. Nor that
FALLOC_FL_ZERO_RANGE should behave differently.

quoted
IOWs, while you might want FALLOC_FL_ZERO_RANGE to explicitly write
zeros, we have users who explicitly don't want it to do this.
Right - which is why I was asking for a variant of FALLOC_FL_ZERO_RANGE
(jokingly named FALLOC_FL_ZERO_RANGE_BUT_REALLY in the subject), rather
than changing the behaviour.

quoted
Perhaps we should add want FALLOC_FL_CONVERT_RANGE, which tells the
filesystem to convert an unwritten range of zeros to a written range
by manually writing zeros. i.e. you do FALLOC_FL_ZERO_RANGE to zero
the range and fill holes using metadata manipulation, followed by
FALLOC_FL_WRITE_RANGE to then convert the "metadata zeros" to real
written zeros.
Yep, something like that would do the trick. Perhaps
FALLOC_FL_MATERIALIZE_RANGE?
[ FWIW, I really dislike the "RANGE" part of fallocate flag names.
It's redundant (fallocate always operates on a range!) and just
makes names unnecessarily longer. ]

I used "convert range" as the name explicitly because it has
specific meaning for extent space manipulation. i.e. we "convert"
extents from one state to another. "write range" is also has
explicit meaning, in that it will convert extents from unwritten to
written data.

In comparison, "materialise" is something undefined, and could be
easily thought to take something ephemeral (such as a hole) and turn
it into something real (an allocated extent). We wouldn't want this
operation to allocate space, so I think "materialise" is just too
much magic to encoding into an API for an explicit, well defined
state change.

We also have people asking for ZERO_RANGE to just flip existing
extents from written to unwritten (rather than the punch/preallocate
we do now). This is also a "convert" operation, just in the other
direction (from data to zeros rather than from zeros to data).

The observation I'm making here is that these "convert" oeprations
will both makes SEEK_HOLE/SEEK_DATA behave differently for the
underlying data. preallocated space is considered a HOLE, written
zeros are considered DATA. So we do expose the ability to check that
a "convert" operation has actually changed the state of the
underlying extents in either direction...

CONVERT_TO_DATA/CONVERT_TO_ZERO as an operational pair whose
behaviour is visible and easily testable via SEEK_HOLE/SEEK_DATA
makes a lot more sense to me. Also defining them to fail fast if
unwritten extents are not supported by the filesystem (i.e. they
should -never- physically write anything) would also allow
applications to fall back to ZERO_RANGE on filesystems that don't
support unwritten extents to explicitly write zeros if
CONVERT_TO_ZERO fails....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help