Re: [PATCH v4 6/6] io_uring: add support for zone-append

(off-list ancestor, not in this archive)
[PATCH v4 0/6] zone-append support in io-uring and aio · Kanchan Joshi <hidden> · 2020-07-24
[PATCH v4 1/6] fs: introduce FMODE_ZONE_APPEND and IOCB_ZONE_APPEND · Kanchan Joshi <hidden> · 2020-07-24
Re: [PATCH v4 1/6] fs: introduce FMODE_ZONE_APPEND and IOCB_ZONE_APPEND · Jens Axboe <axboe@kernel.dk> · 2020-07-24
Re: [PATCH v4 1/6] fs: introduce FMODE_ZONE_APPEND and IOCB_ZONE_APPEND · Christoph Hellwig <hch@infradead.org> · 2020-07-26
Re: [PATCH v4 1/6] fs: introduce FMODE_ZONE_APPEND and IOCB_ZONE_APPEND · Matthew Wilcox <willy@infradead.org> · 2020-07-28
Re: [PATCH v4 1/6] fs: introduce FMODE_ZONE_APPEND and IOCB_ZONE_APPEND · Christoph Hellwig <hch@infradead.org> · 2020-07-28
[PATCH v4 2/6] fs: change ki_complete interface to support 64bit ret2 · Kanchan Joshi <hidden> · 2020-07-24
Re: [PATCH v4 2/6] fs: change ki_complete interface to support 64bit ret2 · Christoph Hellwig <hch@infradead.org> · 2020-07-26
[PATCH v4 3/6] uio: return status with iov truncation · Kanchan Joshi <hidden> · 2020-07-24
[PATCH v4 4/6] block: add zone append handling for direct I/O path · Kanchan Joshi <hidden> · 2020-07-24
Re: [PATCH v4 4/6] block: add zone append handling for direct I/O path · Christoph Hellwig <hch@infradead.org> · 2020-07-26
[PATCH v4 5/6] block: enable zone-append for iov_iter of bvec type · Kanchan Joshi <hidden> · 2020-07-24
Re: [PATCH v4 5/6] block: enable zone-append for iov_iter of bvec type · Christoph Hellwig <hch@infradead.org> · 2020-07-26
[PATCH v4 6/6] io_uring: add support for zone-append · Kanchan Joshi <hidden> · 2020-07-24
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Jens Axboe <axboe@kernel.dk> · 2020-07-24
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Kanchan Joshi <hidden> · 2020-07-27
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Jens Axboe <axboe@kernel.dk> · 2020-07-27
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Pavel Begunkov <asml.silence@gmail.com> · 2020-07-30
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Jens Axboe <axboe@kernel.dk> · 2020-07-30
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Pavel Begunkov <asml.silence@gmail.com> · 2020-07-30
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Jens Axboe <axboe@kernel.dk> · 2020-07-30
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Pavel Begunkov <asml.silence@gmail.com> · 2020-07-30
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Kanchan Joshi <hidden> · 2020-07-30
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Jens Axboe <axboe@kernel.dk> · 2020-07-30
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Kanchan Joshi <hidden> · 2020-07-30
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Damien Le Moal <hidden> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · "hch@infradead.org" <hch@infradead.org> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Damien Le Moal <hidden> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Kanchan Joshi <hidden> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Damien Le Moal <hidden> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · "hch@infradead.org" <hch@infradead.org> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Damien Le Moal <hidden> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · "hch@infradead.org" <hch@infradead.org> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Damien Le Moal <hidden> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · "hch@infradead.org" <hch@infradead.org> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · "hch@infradead.org" <hch@infradead.org> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Kanchan Joshi <hidden> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Luis Chamberlain <mcgrof@kernel.org> · 2022-03-02
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Damien Le Moal <hidden> · 2020-08-05
Re: [PATCH v4 6/6] io_uring: add support for zone-append · "hch@infradead.org" <hch@infradead.org> · 2020-08-14
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Damien Le Moal <hidden> · 2020-08-14
Re: [PATCH v4 6/6] io_uring: add support for zone-append · "hch@infradead.org" <hch@infradead.org> · 2020-08-14
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Damien Le Moal <hidden> · 2020-08-14
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Kanchan Joshi <hidden> · 2020-09-07
Re: [PATCH v4 6/6] io_uring: add support for zone-append · "hch@infradead.org" <hch@infradead.org> · 2020-09-08
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Kanchan Joshi <hidden> · 2020-09-24
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Damien Le Moal <hidden> · 2020-09-25
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Kanchan Joshi <hidden> · 2020-09-28
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Damien Le Moal <hidden> · 2020-09-29
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Kanchan Joshi <hidden> · 2020-09-29
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Luis Chamberlain <mcgrof@kernel.org> · 2022-03-02
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Kanchan Joshi <hidden> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Luis Chamberlain <mcgrof@kernel.org> · 2022-03-02
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Kanchan Joshi <hidden> · 2020-07-31
Re: [PATCH v4 6/6] io_uring: add support for zone-append · Pavel Begunkov <asml.silence@gmail.com> · 2020-07-30

From: Damien Le Moal <hidden>
Date: 2020-07-31 10:16:56
Also in: io-uring, linux-block, linux-fsdevel, lkml

On 2020/07/31 18:41, hch@infradead.org wrote:

On Fri, Jul 31, 2020 at 09:34:50AM +0000, Damien Le Moal wrote:

quoted

Sync writes are done under the inode lock, so there cannot be other writers at
the same time. And for the sync case, since the actual written offset is
necessarily equal to the file size before the write, there is no need to report
it (there is no system call that can report that anyway). For this sync case,
the only change that the use of zone append introduces compared to regular
writes is the potential for more short writes.

Adding a flag for "report the actual offset for appending writes" is fine with
me, but do you also mean to use this flag for driving zone append write vs
regular writes in zonefs ?

Let's keep semantics and implementation separate.  For the case
where we report the actual offset we need a size imitation and no
short writes.

OK. So the name of the flag confused me. The flag name should reflect "Do zone
append and report written offset", right ?

Just to clarify, here was my thinking for zonefs:
1) file open with O_APPEND/aio has RWF_APPEND: then it is OK to assume that the
application did not set the aio offset since APPEND means offset==file size. In
that case, do zone append and report back the written offset.
2) file open without O_APPEND/aio does not have RWF_APPEND: the application
specified an aio offset and we must respect it, write it that exact same order,
so use regular writes.

For regular file systems, with case (1) condition, the FS use whatever it wants
for the implementation, and report back the written offset, which  will always
be the file size at the time the aio was issued.

Your method with a new flag to switch between (1) and (2) is OK with me, but the
"no short writes" may be difficult to achieve in a regular FS, no ? I do not
think current FSes have such guarantees... Especially in the case of buffered
async writes I think.

Anything with those semantics can be implemented using Zone Append
trivially in zonefs, and we don't even need the exclusive lock in that
case.  But even without that flag anything that has an exclusive lock can
at least in theory be implemented using Zone Append, it just need
support for submitting another request from the I/O completion handler
of the first.  I just don't think it is worth it - with the exclusive
lock we do have access to the zone serialied so a normal write works
just fine.  Both for the sync and async case.

We did switch to have zonefs do append writes in the sync case, always. Hmmm...
Not sure anymore it was such a good idea.

quoted

The fcntl or ioctl for getting the max atomic write size would be fine too.
Given that zonefs is very close to the underlying zoned drive, I was assuming
that the application can simply consult the device sysfs zone_append_max_bytes
queue attribute.

For zonefs we can, yes.  But in many ways that is a lot more cumbersome
that having an API that works on the fd you want to write on.

Got it. Makes sense.

quoted

For regular file systems, this value would be used internally
only. I do not really see how it can be useful to applications. Furthermore, the
file system may have a hard time giving that information to the application
depending on its underlying storage configuration (e.g. erasure
coding/declustered RAID).

File systems might have all kinds of limits of their own (e.g. extent
sizes).  And a good API that just works everywhere and is properly
documented is much better than heaps of cargo culted crap all over
applications.

OK. Makes sense. That said, taking Naohiro's work on btrfs as an example, zone
append is used for every data write, no matter if it is O_APPEND/RWF_APPEND or
not. The size limitation for zone append writes is not needed at all by
applications. Maximum extent size is aligned to the max append write size
internally, and if the application issued a larger write, it loops over multiple
extents, similarly to any regular write may do (if there is overwrite etc...).

For the regular FS case, my thinking on the semantic really was: if asked to do
so, return the written offset for a RWF_APPEND aios. And I think that
implementing that does not depend in any way on what the FS does internally.

But I think I am starting to see the picture you are drawing here:
1) Introduce a fcntl() to get "maximum size for atomic append writes"
2) Introduce an aio flag specifying "Do atomic append write and report written
offset"
3) For an aio specifying "Do atomic append write and report written offset", if
the aio is larger than "maximum size for atomic append writes", fail it on
submission, no short writes.
4) For any other aio, it is business as usual, aio is processed as they are now.

And the implementation is actually completely free to use zone append writes or
regular writes regardless of the "Do atomic append write and report written
offset" being used or not.

Is it your thinking ? That would work for me. That actually end up completely
unifying the interface behavior for zonefs and regular FS. Same semantic.


-- 
Damien Le Moal
Western Digital Research

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help