Thread (36 messages) 36 messages, 8 authors, 2022-12-29

Re: [dm-devel] [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

From: Sarthak Kukreti <hidden>
Date: 2022-09-17 19:46:54
Also in: dm-devel, linux-ext4, lkml

On Fri, Sep 16, 2022 at 8:03 PM Darrick J. Wong [off-list ref] wrote:
On Thu, Sep 15, 2022 at 09:48:18AM -0700, Sarthak Kukreti wrote:
quoted
From: Sarthak Kukreti <redacted>

Hi,

This patch series is an RFC of a mechanism to pass through provision
requests on stacked thinly provisioned storage devices/filesystems.
[Reflowed text]
quoted
The linux kernel provides several mechanisms to set up thinly
provisioned block storage abstractions (eg. dm-thin, loop devices over
sparse files), either directly as block devices or backing storage for
filesystems. Currently, short of writing data to either the device or
filesystem, there is no way for users to pre-allocate space for use in
such storage setups. Consider the following use-cases:

1) Suspend-to-disk and resume from a dm-thin device: In order to
ensure that the underlying thinpool metadata is not modified during
the suspend mechanism, the dm-thin device needs to be fully
provisioned.
2) If a filesystem uses a loop device over a sparse file, fallocate()
on the filesystem will allocate blocks for files but the underlying
sparse file will remain intact.
3) Another example is virtual machine using a sparse file/dm-thin as a
storage device; by default, allocations within the VM boundaries will
not affect the host.
4) Several storage standards support mechanisms for thin provisioning
on real hardware devices. For example:
  a. The NVMe spec 1.0b section 2.1.1 loosely talks about thin
  provisioning: "When the THINP bit in the NSFEAT field of the
  Identify Namespace data structure is set to ‘1’, the controller ...
  shall track the number of allocated blocks in the Namespace
  Utilization field"
  b. The SCSi Block Commands reference - 4 section references "Thin
  provisioned logical units",
  c. UFS 3.0 spec section 13.3.3 references "Thin provisioning".

In all of the above situations, currently the only way for
pre-allocating space is to issue writes (or use
WRITE_ZEROES/WRITE_SAME). However, that does not scale well with
larger pre-allocation sizes.

This patchset introduces primitives to support block-level
provisioning (note: the term 'provisioning' is used to prevent
overloading the term 'allocations/pre-allocations') requests across
filesystems and block devices. This allows fallocate() and file
creation requests to reserve space across stacked layers of block
devices and filesystems. Currently, the patchset covers a prototype on
the device-mapper targets, loop device and ext4, but the same
mechanism can be extended to other filesystems/block devices as well
as extended for use with devices in 4 a-c.
If you call REQ_OP_PROVISION on an unmapped LBA range of a block device
and then try to read the provisioned blocks, what do you get?  Zeroes?
Random stale disk contents?

I think I saw elsewhere in the thread that any mapped LBAs within the
provisioning range are left alone (i.e. not zeroed) so I'll proceed on
that basis.
For block devices, I'd say it's definitely possible to get stale data, depending
on the implementation of the allocation layer; for example, with dm-thinpool,
the default setting via using LVM2 tools is to zero out blocks on allocation.
But that's configurable and can be turned off to improve performance.

Similarly, for actual devices that end up supporting thin provisioning, unless
the specification absolutely mandates that an LBA contains zeroes post
allocation, some implementations will definitely miss out on that (probably
similar to the semantics of discard_zeroes_data today). I'm operating under
the assumption that it's possible to get stale data from LBAs allocated using
provision requests at the block layer and trying to see if we can create a
safe default operating model from that.
quoted
Patch 1 introduces REQ_OP_PROVISION as a new request type. The
provision request acts like the inverse of a discard request; instead
of notifying lower layers that the block range will no longer be used,
provision acts as a request to lower layers to provision disk space
for the given block range. Real hardware storage devices will
currently disable the provisioing capability but for the standards
listed in 4a.-c., REQ_OP_PROVISION can be overloaded for use as the
provisioing primitive for future devices.

Patch 2 implements REQ_OP_PROVISION handling for some of the
device-mapper targets. This additionally adds support for
pre-allocating space for thinly provisioned logical volumes via
fallocate()

Patch 3 implements the handling for virtio-blk.

Patch 4 introduces an fallocate() mode (FALLOC_FL_PROVISION) that
sends a provision request to the underlying block device (and beyond).
This acts as the primary mechanism for file-level provisioing.
Personally, I think it's well within the definition of fallocate mode==0
(aka preallocate) for XFS to call REQ_OP_PROVISION on the blocks that it
preallocates?  XFS always sets the unwritten flag on the file mapping,
so it doesn't matter if the device provisions space without zeroing the
contents.

That said, if devices are really allowed to expose stale disk blocks
then for blkdev fallocate I think you could get away with reusin
FALLOC_FL_NO_HIDE_STALE instead of introducing a new fallocate flag.
For filesystems, I think it's reasonable to support the mode if and only if
the filesystem can guarantee that unwritten extents return zero. For instance,
in the current ext4 implementation, the provisioned extents are still marked as
unwritten, which means a read from the file would still show all zeroes (which
I think differs from the original FALLOC_FL_NO_HIDE implementation).

That might be one more reason to keep the mode separate from the regular
modes though; to drive home the point that it is only acceptable under
the above conditions and that there's more to it than just adding
blkdev_issue_provision(..) at the end of fs_fallocate().

Best
Sarthak
quoted
Patch 5 wires up the loop device handling of REQ_OP_PROVISION.

Patches 6-8 cover a prototype implementation for ext4, which includes
wiring up the fallocate() implementation, introducing a filesystem
level option (called 'provision') to control the default allocation
behaviour and finally a file level override to retain current
handling, even on filesystems mounted with 'provision'
Hmm, I'll have a look.
quoted
Testing:
--------
- A backport of this patch series was tested on ChromiumOS using a
5.10 kernel.
- File on ext4 on a thin logical volume:
fallocate(FALLOC_FL_PROVISION) : 4.6s, dd if=/dev/zero of=...: 6 mins.

TODOs:
------
1) The stacked block devices (dm-*, loop etc.) currently
unconditionally pass through provision requests. Add support for
provision, similar to how discard handling is set up (with options to
disable, passdown or passthrough requests).
2) Blktests and Xfstests for validating provisioning.
Yes....

--D
quoted
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help