Re: Behavior of btrfs_io_geometry()
From: Naohiro Aota <naohiro.aota@wdc.com>
Date: 2021-05-20 07:25:30
Thanks for the comments. On Thu, May 20, 2021 at 09:58:28AM +0300, Nikolay Borisov wrote:
On 20.05.21 г. 9:11, Naohiro Aota wrote:quoted
I have a few questions about btrfs_io_geometry()'s behavior. While I'm testing zoned btrfs on ZNS drive with 2GB zone size, I hit the following ASSERT in btrfs_submit_direct() by running fstests btrfs/017. static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap, struct bio *dio_bio, loff_t file_offset) { ... start_sector = dio_bio->bi_iter.bi_sector; submit_len = dio_bio->bi_iter.bi_size; do { logical = start_sector << 9; em = btrfs_get_chunk_map(fs_info, logical, submit_len); ... ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio), logical, submit_len, &geom); ... ASSERT(geom.len <= INT_MAX); clone_len = min_t(int, submit_len, geom.len); ... bio = btrfs_bio_clone_partial(dio_bio, clone_offset, clone_len); On zoned btrfs, we create a SINGLE block group whose size is equal to the device zone size, so we have a 2 GB SINGLE block group on a 2 GB zone size drive. Then, on a SINGLE single block group, btrfs_io_geometry() returns the remaining length from @logical to the end of the block group regardless of the @len argument. Thus, with @logical == 0, we get geom->len = 2 GB, which is larger than INT_MAX, hitting the ASSERT. I'm confusing because I'm not sure what the ASSERT wants to do. It might want to guard btrfs_bio_clone_partial() (and bio_trim()) below? But, since bio_trim() takes sector-based values, and the passed "clone_offset" and "clone_len" is byte-based, we can technically allow larger bytes than INT_MAX. (well, we might never build such large bio, though). And, it looks meaningless to check geom->len here. Since, it can be much larger than bio size on a SINGLE block group. So, in summary, below are my questions. 1. About btrfs_bio_clone_partial() 1.1 What is the meaning of geom->len? - Length from @logical to the stripe boundary? or - Length [logical, logical+len] can go without crossing the boundary?So think of it as splitting a block group into 64k chunks (those are the stripes) then finding in which such 64k chunk/stripe_nr logical falls within, so len would be the length of an io you could do without crossing this 64k boundary. This process is described in https://github.com/btrfs/btrfs-dev-docs/blob/master/bio-submission.txt under section 'Bio mapping'. With this in mind I'd say it's [logical, logical+len] without crossing a 64k boundary. !!!HOWEVER!!!! This applies to bg types other than SINGLE , that's the if (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) check in the function. We do this because for raid levels we'd like the io granularity to be 64k, whereas for SINGLE we simply issue a single write, i.e len is calculated as: len = em->len - offset; which is why you are getting such the size of the block group.
OK, it's the current behavior. So, we should expect geom->len vary the meaning depending on the block group profile type?
quoted
1.2 @len is currently unused in btrfs_bio_clone_partial(), is this correct?It's used to calculate @clone_len parameter so it's indirectly involved in btrfs_bio_clone_partial.
Oops, I made a mistake here. I meant btrfs_get_io_geometry() here. The same for the bullet "1." above. In btrfs_get_io_geometry(), we don't use @len.
quoted
1.3 What should we fill into geom->len when the block group is SINGLE profile?See above.quoted
2. About the ASSERT 2.1 Shouldn't we check submit_len (= bio's length) instead of geom->len ?The assert got added quite a while back so I'm not entirely sure it's pertinent.quoted
2.2 Can't it be larger than INT_MAX? e.g., INT_MAX << SECTOR_SHIFT? 3. About btrfs_bio_clone_partial() 3.1 We can change "int" to "u64" maybe?Can we though? So if size in btrfs_bio_clone_partial is switched to u64 then in that fuction we do size >> 9 potentially having a value of at most 55 bits, passing it to bio_trim, which in turn takes an int. So you'd be truncating values at the time you call bio_trim, no ?
Yes, but we can still safely allow the value up to "INT_MAX << 9", can't we? We may need to add another ASSERT here in this case...