Re: [PATCH v2 0/2] zone-append support in io-uring and aio
From: Matthew Wilcox <willy@infradead.org>
Date: 2020-06-30 12:46:56
Also in:
io-uring, linux-fsdevel, lkml
On Thu, Jun 25, 2020 at 10:45:47PM +0530, Kanchan Joshi wrote:
Zone-append completion result --->
With zone-append, where write took place can only be known after completion.
So apart from usual return value of write, additional mean is needed to obtain
the actual written location.
In aio, this is returned to application using res2 field of io_event -
struct io_event {
__u64 data; /* the data field from the iocb */
__u64 obj; /* what iocb this event came from */
__s64 res; /* result code for this event */
__s64 res2; /* secondary result */
};Ah, now I understand. I think you're being a little too specific by calling this zone-append. This is really a "write-anywhere" operation, and the specified address is only a hint.
In io-uring, cqe->flags is repurposed for zone-append result.
struct io_uring_cqe {
__u64 user_data; /* sqe->data submission passed back */
__s32 res; /* result code for this event */
__u32 flags;
};
Since 32 bit flags is not sufficient, we choose to return zone-relative offset
in sector/512b units. This can cover zone-size represented by chunk_sectors.
Applications will have the trouble to combine this with zone start to know
disk-relative offset. But if more bits are obtained by pulling from res field
that too would compel application to interpret res field differently, and it
seems more painstaking than the former option.
To keep uniformity, even with aio, zone-relative offset is returned.Urgh, no, that's dreadful. I'm not familiar with the io_uring code. Maybe the first 8 bytes of the user_data could be required to be the result offset for this submission type?
Block IO vs File IO ---> For now, the user zone-append interface is supported only for zoned-block-device. Regular files/block-devices are not supported. Regular file-system (e.g. F2FS) will not need this anyway, because zone peculiarities are abstracted within FS. At this point, ZoneFS also likes to use append implicitly rather than explicitly. But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check allowing-only-block-device should be changed.
But we also have O_APPEND files. And maybe we'll have other kinds of file in future for which this would make sense.
Semantics ---> Zone-append, by its nature, may perform write on a different location than what was specified. It does not fit into POSIX, and trying to fit may just undermine
... I disagree that it doesn't fit into POSIX. As I said above, O_APPEND is a POSIX concept, so POSIX already understands that writes may not end up at the current write pointer.
its benefit. It may be better to keep semantics as close to zone-append as possible i.e. specify zone-start location, and obtain the actual-write location post completion. Towards that goal, existing async APIs seem to fit fine. Async APIs (uring, linux aio) do not work on implicit write-pointer and demand explicit write offset (which is what we need for append). Neither write-pointer is taken as input, nor it is updated on completion. And there is a clear way to get zone-append result. Zone-aware applications while using these async APIs can be fine with, for the lack of better word, zone-append semantics itself. Sync APIs work with implicit write-pointer (at least few of those), and there is no way to obtain zone-append result, making it hard for user-space zone-append.