Re: [PATCH 10/21] block: Add fops atomic write support
From: Ming Lei <hidden>
Date: 2023-12-05 01:46:12
Also in:
linux-block, linux-fsdevel, linux-nvme, linux-xfs, lkml
On Mon, Dec 04, 2023 at 01:13:55PM +0000, John Garry wrote:
quoted
quoted
I added this here (as opposed to the caller), as I was not really worried about speeding up the failure path. Are you saying to call even earlier in submission path?atomic_write_unit_min is one hardware property, and it should be checked in blk_queue_atomic_write_unit_min_sectors() from beginning, then you can avoid this check every other where.ok, but we still need to ensure in the submission path that the block device actually supports atomic writes - this was the initial check.
Then you may add one helper bdev_support_atomic_write().
quoted
quoted
quoted
quoted
+ if (pos % atomic_write_unit_min_bytes) + return false; + if (iov_iter_count(iter) % atomic_write_unit_min_bytes) + return false; + if (!is_power_of_2(iov_iter_count(iter))) + return false; + if (iov_iter_count(iter) > atomic_write_unit_max_bytes) + return false; + if (pos % iov_iter_count(iter)) + return false;I am a bit confused about relation between atomic_write_unit_max_bytes and atomic_write_max_bytes.I think that naming could be improved. Or even just drop merging (and atomic_write_max_bytes concept) until we show it to improve performance. So generally atomic_write_unit_max_bytes will be same as atomic_write_max_bytes, however it could be different if: a. request queue nr hw segments or other request queue limits needs to restrict atomic_write_unit_max_bytes b. atomic_write_unit_max_bytes does not need to be a power-of-2 and atomic_write_max_bytes does. So essentially: atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes)plug merge often improves sequential IO perf, so if the hardware supports this way, I think 'atomic_write_max_bytes' should be supported from the beginning, such as: - user space submits sequential N * (4k, 8k, 16k, ...) atomic writes, all can be merged to single IO request, which is issued to driver. Or - user space submits sequential 4k, 4k, 8k, 16K, 32k, 64k atomic writes, all can be merged to single IO request, which is issued to driver.Right, we do expect userspace to use a fixed block size, but we give scope in the API to use variable size.
Maybe it is enough to just take atomic_write_unit_min_bytes only, and allow length to be N * atomic_write_unit_min_bytes. But it may violate atomic write boundary?
quoted
The hardware should recognize unit size by start LBA, and check if length is valid, so probably the interface might be relaxed to: 1) start lba is unit aligned, and this unit is in the supported unit range(power_2 in [unit_min, unit_max]) 2) length needs to be: - N * this_unit_size - <= atomic_write_max_bytesPlease note that we also need to consider: - any atomic write boundary (from NVMe)
Can you provide actual NVMe boundary value?
Firstly natural aligned write won't cross boundary, so boundary should
be >= write_unit_max, see blow code from patch 10/21:
+static bool bio_straddles_atomic_write_boundary(loff_t bi_sector,
+ unsigned int bi_size,
+ unsigned int boundary)
+{
+ loff_t start = bi_sector << SECTOR_SHIFT;
+ loff_t end = start + bi_size;
+ loff_t start_mod = start % boundary;
+ loff_t end_mod = end % boundary;
+
+ if (end - start > boundary)
+ return true;
+ if ((start_mod > end_mod) && (start_mod && end_mod))
+ return true;
+
+ return false;
+}
+
Then if the WRITE size is <= boundary, the above function should return
false, right? Looks like it is power_of(2) & aligned atomic_write_max_bytes?
- virt boundary (from NVMe)
virt boundary is applied on bv_offset and bv_len, and NVMe's virt bounary is (4k - 1), it shouldn't be one issue in reality.
And, as I mentioned elsewhere, I am still not 100% comfortable that we don't pay attention to regular max_sectors_kb...
max_sectors_kb should be bigger than atomic_write_max_bytes actually, then what is your concern? Thanks, Ming