Re: [PATCH v2 10/11] iomap: Add support for zone append writes
From: Damien Le Moal <hidden>
Date: 2020-03-25 05:38:58
Also in:
linux-fsdevel, linux-scsi
On 2020/03/25 7:46, Dave Chinner wrote:
On Wed, Mar 25, 2020 at 12:24:53AM +0900, Johannes Thumshirn wrote:quoted
Use REQ_OP_ZONE_APPEND for direct I/O write BIOs, instead of REQ_OP_WRITE if the file-system requests it. The file system can make this request by setting the new flag IOCB_ZONE_APPEND for a direct I/O kiocb before calling iompa_dio_rw(). Using this information, this function propagates the zone append flag using IOMAP_ZONE_APPEND to the file system iomap_begin() method. The BIOs submitted for the zone append DIO will be set to use the REQ_OP_ZONE_APPEND operation. Since zone append operations cannot be split, the iomap_apply() and iomap_dio_rw() internal loops are executed only once, which may result in short writes. Signed-off-by: Johannes Thumshirn <redacted> --- fs/iomap/direct-io.c | 80 ++++++++++++++++++++++++++++++++++++------- include/linux/fs.h | 1 + include/linux/iomap.h | 22 ++++++------ 3 files changed, 79 insertions(+), 24 deletions(-)diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 23837926c0c5..b3e2aadce72f 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c@@ -17,6 +17,7 @@ * Private flags for iomap_dio, must not overlap with the public ones in * iomap.h: */ +#define IOMAP_DIO_ZONE_APPEND (1 << 27) #define IOMAP_DIO_WRITE_FUA (1 << 28) #define IOMAP_DIO_NEED_SYNC (1 << 29) #define IOMAP_DIO_WRITE (1 << 30)@@ -39,6 +40,7 @@ struct iomap_dio { struct task_struct *waiter; struct request_queue *last_queue; blk_qc_t cookie; + sector_t sector; } submit; /* used for aio completion: */@@ -151,6 +153,9 @@ static void iomap_dio_bio_end_io(struct bio *bio) if (bio->bi_status) iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status)); + if (dio->flags & IOMAP_DIO_ZONE_APPEND) + dio->submit.sector = bio->bi_iter.bi_sector; + if (atomic_dec_and_test(&dio->ref)) { if (dio->wait_for_completion) { struct task_struct *waiter = dio->submit.waiter;@@ -194,6 +199,21 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, iomap_dio_submit_bio(dio, iomap, bio); } +static sector_t +iomap_dio_bio_sector(struct iomap_dio *dio, struct iomap *iomap, loff_t pos) +{ + sector_t sector = iomap_sector(iomap, pos); + + /* + * For zone append writes, the BIO needs to point at the start of the + * zone to append to. + */ + if (dio->flags & IOMAP_DIO_ZONE_APPEND) + sector = ALIGN_DOWN(sector, bdev_zone_sectors(iomap->bdev)); + + return sector; +}This seems to me like it should be done by the ->iomap_begin implementation when mapping the IO. I don't see why this needs to be specially handled by the iomap dio code.
Fair point. However, iomap_sector() does not simply return iomap->addr >> 9. So that means that in iomap_begin, the mapping address and offset returned needs to account for the iomap_sector() calculation, i.e. (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT which does not result in a very obvious values for iomap address and offset. Well I guess for zone append we simply need to set iomap->offset = pos; and iomap->addr = zone start sector; and that would then work. This however look fragile to me.
quoted
+ static loff_t iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, struct iomap_dio *dio, struct iomap *iomap)@@ -204,6 +224,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, struct bio *bio; bool need_zeroout = false; bool use_fua = false; + bool zone_append = false; int nr_pages, ret = 0; size_t copied = 0; size_t orig_count;@@ -235,6 +256,9 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, use_fua = true; } + if (dio->flags & IOMAP_DIO_ZONE_APPEND) + zone_append = true; + /* * Save the original count and trim the iter to just the extent we * are operating on right now. The iter will be re-expanded once@@ -266,12 +290,28 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, bio = bio_alloc(GFP_KERNEL, nr_pages); bio_set_dev(bio, iomap->bdev); - bio->bi_iter.bi_sector = iomap_sector(iomap, pos); + bio->bi_iter.bi_sector = iomap_dio_bio_sector(dio, iomap, pos); bio->bi_write_hint = dio->iocb->ki_hint; bio->bi_ioprio = dio->iocb->ki_ioprio; bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; + if (dio->flags & IOMAP_DIO_WRITE) { + bio->bi_opf = REQ_SYNC | REQ_IDLE; + if (zone_append) + bio->bi_opf |= REQ_OP_ZONE_APPEND; + else + bio->bi_opf |= REQ_OP_WRITE; + if (use_fua) + bio->bi_opf |= REQ_FUA; + else + dio->flags &= ~IOMAP_DIO_WRITE_FUA; + } else { + bio->bi_opf = REQ_OP_READ; + if (dio->flags & IOMAP_DIO_DIRTY) + bio_set_pages_dirty(bio); + }Why move all this code? If it's needed, please split it into a separate patchi to separate it from the new functionality...
The BIO add page in bio_iov_iter_get_pages() needs to know that the BIO is a zone append to stop adding pages before the BIO grows too large and result in a split when it is submitted. So bio->bi_opf needs to be set before calling bio_iov_iter_get_pages(). So this change is related to the new functionality. But I guess it is OK to do it regardless of the BIO operation.
quoted
+ ret = bio_iov_iter_get_pages(bio, dio->submit.iter); if (unlikely(ret)) { /*@@ -284,19 +324,10 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, goto zero_tail; } - n = bio->bi_iter.bi_size; - if (dio->flags & IOMAP_DIO_WRITE) { - bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE; - if (use_fua) - bio->bi_opf |= REQ_FUA; - else - dio->flags &= ~IOMAP_DIO_WRITE_FUA; + if (dio->flags & IOMAP_DIO_WRITE) task_io_account_write(n); - } else { - bio->bi_opf = REQ_OP_READ; - if (dio->flags & IOMAP_DIO_DIRTY) - bio_set_pages_dirty(bio); - } + + n = bio->bi_iter.bi_size; dio->size += n; pos += n;@@ -304,6 +335,15 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, nr_pages = iov_iter_npages(dio->submit.iter, BIO_MAX_PAGES); iomap_dio_submit_bio(dio, iomap, bio); + + /* + * Issuing multiple BIOs for a large zone append write can + * result in reordering of the write fragments and to data + * corruption. So always stop after the first BIO is issued. + */ + if (zone_append) + break;I don't think this sort of functionality should be tied to "zone append". If there is a need for "issue a single (short) bio only" it should be a flag to iomap_dio_rw() set by the filesystem, which can then handle the short read/write that is returned.
Yes, that would be cleaner.
quoted
+ /* + * Zone append writes cannot be split and be shorted. Break + * here to let the user know instead of sending more IOs which + * could get reordered and corrupt the written data. + */ + if (flags & IOMAP_ZONE_APPEND) + break;ditto. Cheers, Dave.
-- Damien Le Moal Western Digital Research