Re: [PATCH 09/10] iomap: add a IOMAP_DIO_NOALLOC flag

From: Avi Kivity <hidden>
Date: 2021-01-14 10:44:30
Also in: linux-fsdevel

On 1/14/21 12:23 PM, Brian Foster wrote:

On Thu, Jan 14, 2021 at 09:49:35AM +1100, Dave Chinner wrote:

quoted

On Wed, Jan 13, 2021 at 10:32:15AM -0500, Brian Foster wrote:

quoted

On Wed, Jan 13, 2021 at 10:29:23AM +1100, Dave Chinner wrote:

quoted

On Tue, Jan 12, 2021 at 05:26:15PM +0100, Christoph Hellwig wrote:

quoted

Add a flag to request that the iomap instances do not allocate blocks
by translating it to another new IOMAP_NOALLOC flag.

Except "no allocation" that is not what XFS needs for concurrent
sub-block DIO.

We are trying to avoid external sub-block IO outside the range of
the user data IO (COW, sub-block zeroing, etc) so that we don't
trash adjacent sub-block IO in flight. This means we can't do
sub-block zeroing and that then means we can't map unwritten extents
or allocate new extents for the sub-block IO.  It also means the IO
range cannot span EOF because that triggers unconditional sub-block
zeroing in iomap_dio_rw_actor().

And because we may have to map multiple extents to fully span an IO
range, we have to guarantee that subsequent extents for the IO are
also written otherwise we have a partial write abort case. Hence we
have single extent limitations as well.

So "no allocation" really doesn't describe what we want this flag to
at all.

If we're going to use a flag for this specific functionality, let's
call it what it is: IOMAP_DIO_UNALIGNED/IOMAP_UNALIGNED and do two
things with it.

	1. Make unaligned IO a formal part of the iomap_dio_rw()
	behaviour so it can do the common checks to for things that
	need exclusive serialisation for unaligned IO (i.e. avoid IO
	spanning EOF, abort if there are cached pages over the
	range, etc).

	2. require the filesystem mapping callback do only allow
	unaligned IO into ranges that are contiguous and don't
	require mapping state changes or sub-block zeroing to be
	performed during the sub-block IO.

Something I hadn't thought about before is whether applications might
depend on current unaligned dio serialization for coherency and thus
break if the kernel suddenly allows concurrent unaligned dio to pass
through. Should this be something that is explicitly requested by
userspace?

If applications are relying on an undocumented, implementation
specific behaviour of a filesystem that only occurs for IOs of a
certain size for implicit data coherency between independent,
non-overlapping DIOs and/or page cache IO, then they are already
broken and need fixing because that behaviour is not guaranteed to
occur. e.g. 512 byte block size filesystem does not provide such
serialisation, so if the app depends on 512 byte DIOs being
serialised completely by the filesytem then it already fails on 512
byte block size filesystems.

I'm not sure how the block size relates beyond just changing the
alignment requirements..?

quoted

So, no, we simply don't care about breaking broken applications that
are already broken.

I agree in general, but I'm not sure that helps us on the "don't break
userspace" front. We can call userspace broken all we want, but if some
application has such a workload that historically functions correctly
due to this serialization and all of a sudden starts to cause data
corruption because we decide to remove it, I fear we'd end up taking the
blame regardless. :/


I think it's unlikely. Application writers rarely know about such 
issues, so they can't knowingly depend on them. The sub-sub-genre of 
application writers who rely on dio/aio will be a lot more careful and 
wary of the filesystem.


In this particular case, triggering serialization also triggers blocking 
in io_submit, which is the aio/dio user's worst nightmare, by several 
orders of magnitude than the runner up. I have code to detect these 
cases and try to prevent serialization, or, when serialization is 
inevitable, do the serialization in userspace so my io_submits don't get 
blocked.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help