Re: [PATCH v2 2/3] btrfs: zoned: fix compressed writes
From: Damien Le Moal <hidden>
Date: 2021-06-10 07:45:57
On 2021/06/10 16:41, Qu Wenruo wrote:
On 2021/6/10 下午3:36, Damien Le Moal wrote:quoted
On 2021/06/10 16:28, Qu Wenruo wrote:quoted
On 2021/5/18 下午11:40, Johannes Thumshirn wrote:quoted
When multiple processes write data to the same block group on a compressed zoned filesystem, the underlying device could report I/O errors and data corruption is possible. This happens because on a zoned file system, compressed data writes where sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND operation. But with REQ_OP_WRITE and parallel submission it cannot be guaranteed that the data is always submitted aligned to the underlying zone's write pointer. The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned filesystem is non intrusive on a regular file system or when submitting to a conventional zone on a zoned filesystem, as it is guarded by btrfs_use_zone_append. Reported-by: David Sterba <dsterba@suse.com> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag") Signed-off-by: Johannes Thumshirn <redacted>Now working on compression support for subpage, just noticed some strange code behavior, I'm not sure if it's designed or just a typo. So please correct me if possible. [...]quoted
bio = btrfs_bio_alloc(first_byte); - bio->bi_opf = REQ_OP_WRITE | write_flags; + bio->bi_opf = bio_op | write_flags; bio->bi_private = cb; bio->bi_end_io = end_compressed_bio_write; + if (use_append) { + struct extent_map *em; + struct map_lookup *map; + struct block_device *bdev; + + em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE); + if (IS_ERR(em)) { + kfree(cb); + bio_put(bio); + return BLK_STS_NOTSUPP; + } + + map = em->map_lookup; + /* We only support single profile for now */ + ASSERT(map->num_stripes == 1); + bdev = map->stripes[0].dev->bdev;This variable seems rather useless...No need to bother that, that has already been removed by later refactor.quoted
quoted
quoted
+ + bio_set_dev(bio, bdev); + free_extent_map(em); + } +Here for the newly created bio, we will try to call bio_set_dev() for it. (although later patch refactor this part a little) So far so good.quoted
if (blkcg_css) { bio->bi_opf |= REQ_CGROUP_PUNT; kthread_associate_blkcg(blkcg_css);@@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, bytes_left = compressed_len; for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) { int submit = 0; + int len; page = compressed_pages[pg_index]; page->mapping = inode->vfs_inode.i_mapping;@@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0); + if (pg_index == 0 && use_append) + len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); + else + len = bio_add_page(bio, page, PAGE_SIZE, 0); + page->mapping = NULL; - if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) < - PAGE_SIZE) { + if (submit || len < PAGE_SIZE) { /* * inc the count before we submit the bio so * we know the end IO handler won't happen before@@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, } bio = btrfs_bio_alloc(first_byte); - bio->bi_opf = REQ_OP_WRITE | write_flags; + bio->bi_opf = bio_op | write_flags;But here, for the newly allocated bio, we didn't call bio_set_dev() at all. Shouldn't all zoned write bio need that bio_set_dev() call?Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called. Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer oops here... Johannes ?That's because it's really really rare/hard to have a compressed extent just lies at the stripe boundary. For most cases, the data we provide for compression tests is either: - Too compressible Thus the whole range can be compressed into just one sector, thus it will never cross stripe boundary. - Not compressible at all We fall back to regular buffered write, which will do their proper stripe boundary check correctly. Thus it's really near impossible to hit it in various tests.
But this is a data write, isn't it ? So in the zoned case, it means a zone append write. And adding a page for even a single sector using bio_add_zone_append_page() will oops if the bio bdev is not set, regardless of the bio size... Am I misunderstanding something here about this IO path ?
Thanks, Ququoted
quoted
I guess since most compressed extents are pretty small, it's really hard to hit a case where we need to split the bio due to stripe boundary, thus very hard to hit anything wrong. Anyway, since I'm working on compression code to make compressed write to follow the same boundary check in extent_io.c, I can definitely refactor the bio allocation code to add the zoned needed calls. Thanks, Ququoted
bio->bi_private = cb; bio->bi_end_io = end_compressed_bio_write; if (blkcg_css) bio->bi_opf |= REQ_CGROUP_PUNT; + /* + * Use bio_add_page() to ensure the bio has at least one + * page. + */ bio_add_page(bio, page, PAGE_SIZE, 0); } if (bytes_left < PAGE_SIZE) {
-- Damien Le Moal Western Digital Research